Skip to content

Pointer passed where number is expected for PYTORCH_CUDA_FUSER_JIT_OPT_LEVEL leading to crash #52147

@Flamefire

Description

@Flamefire

🐛 Bug

The CUDA API expects a void** for option values for functions like cuModuleLoadDataEx. The documentation seems to be unclear, what that should be but according to other sources (see below) that value should be simply the value casted to a void*, not a pointer to that value.
Hence the code at

option_vals.emplace_back(&jit_opt_level);
is wrong and may lead to failed executions or wrong optimization levels.

I've seen this in one of the PyTorch tests (see below) where I get:

======================================================================
ERROR: test_unary_ops (test_jit_cuda_fuser.TestCudaFuser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/install_pt/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 827, in wrapper
    method(*args, **kwargs)
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2020b/pytorch-1.7.1/test/test_jit_cuda_fuser.py", line 369, in test_unary_ops
    self._unary_test_helper(op)
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2020b/pytorch-1.7.1/test/test_jit_cuda_fuser.py", line 328, in _unary_test_helper
    jit_o = t_jit(x, 2.0)
  File "/tmp/install_pt/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 126, in prof_func_call
    return prof_callable(func_call, *args, **kwargs)
  File "/tmp/install_pt/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 123, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter2.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: a PTX JIT compilation failed

And to verify I added the following code to torch/csrc/jit/codegen/cuda/executor_utils.cpp above the call to cuModuleLoadDataEx:

  options.push_back(CU_JIT_ERROR_LOG_BUFFER);
  options.push_back(CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES);
  std::string errors(8000, '\0');
  option_vals.push_back((void*) errors.data());
  option_vals.push_back((void*) errors.size());

When printing this string on failure I got:

ptxas fatal : 32-bit integer value (3849789140) out of range

This is exactly the pointer to jit_opt_level which confirms the above.

PS: It is likely a good idea to include the JIT error buffer in PyTorch and report it on failure.

References:

To Reproduce

Steps to reproduce the behavior:

  1. python test_jit_cuda_fuser_legacy.py -k test_unary_ops

Environment

  • PyTorch Version (e.g., 1.0): 1.7.1, master

cc @gmagogsfm

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: jitAdd this issue/PR to JIT oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions