Skip to content

Building cuda kernels with debug information causes some cuda kernel launches to fail due to capacity constraints #160225

@galv

Description

@galv

I'm not sure that this one is worth fixing, but I want to document it.

If you compile your cuda kernels with the "-G" flag to compile in debug mode, it turns off optimizations. Turning off optimizations normally is expected not to change the functionality of your code, but unfortunately it can with cuda code.

The best way to turn on debug information is by adding string(APPEND CMAKE_CUDA_FLAGS_DEBUG " -G") right here:

Turning off optimizations will increase the number of registers used by your code, which can prevent certain block sizes from being used, because that block size would use more registers than there are in SM, which results in cudaErrorLaunchOutOfResources.

Specifically, I have seen that torch._C._nn.cross_entropy_loss can fail, though I did not bother to document the sizes and dtypes which cause the failure.

I'm not sure this is worth fixing, since it's hard to fix and a rarely used option. But I think it is worthwhile again, to document.

cc @malfet @seemethere @ptrblck @msaroufim @eqy @jerryzh168

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: buildBuild system issuesmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: debug-buildRelated to building and testing PyTorch in debug modemodule: third_partytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions