Building cuda kernels with debug information causes some cuda kernel launches to fail due to capacity constraints

I'm not sure that this one is worth fixing, but I want to document it. 

If you compile your cuda kernels with the "-G" flag to compile in debug mode, it turns off optimizations. Turning off optimizations normally is expected not to change the functionality of your code, but unfortunately it can with cuda code.

The best way to turn on debug information is by adding `string(APPEND CMAKE_CUDA_FLAGS_DEBUG " -G")` right here:

https://github.com/pytorch/pytorch/blob/2247aa6d1d43e256255f5c74a781c3190a4387b6/CMakeLists.txt#L902

Turning off optimizations will increase the number of registers used by your code, which can prevent certain block sizes from being used, because that block size would use more registers than there are in SM, which results in cudaErrorLaunchOutOfResources.

Specifically, I have seen that torch._C._nn.cross_entropy_loss can fail, though I did not bother to document the sizes and dtypes which cause the failure.

I'm not sure this is worth fixing, since it's hard to fix and a rarely used option. But I think it is worthwhile again, to document.

cc @malfet @seemethere @ptrblck @msaroufim @eqy @jerryzh168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building cuda kernels with debug information causes some cuda kernel launches to fail due to capacity constraints #160225

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Building cuda kernels with debug information causes some cuda kernel launches to fail due to capacity constraints #160225

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions