-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Description
🐛 Describe the bug
import torch
import torch._inductor.config as config
from torch import Tensor
config.cpp_wrapper = True
@torch.compile
def foo(x: Tensor):
return x.sin()
x = torch.tensor(0.0, device="cuda", requires_grad=True)
foo(x).backward()
print(x.grad)
run this code TWICE will get an error in the second run:
Traceback (most recent call last):
File "/root/modded-nanogpt/custom_op_cache.py", line 15, in <module>
foo(x).backward()
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 648, in backward
torch.autograd.backward(
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1958, in backward
return impl_fn()
^^^^^^^^^
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1944, in impl_fn
out = CompiledFunction._backward_impl(ctx, all_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2079, in _backward_impl
out = call_func_at_runtime_with_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
out = normalize_as_list(f(args))
^^^^^^^
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 464, in __call__
return self.current_callable(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 2203, in run
return model(new_inputs)
^^^^^^^^^^^^^^^^^
File "/tmp/torchinductor_root/pw/cpwoz7xtew3ko7zejrn4bsrizhftvllcrykvty7vz5xn6v3zmkbp.py", line 262, in g
output_handles = f(input_handles)
^^^^^^^^^^^^^^^^
RuntimeError: CUDA driver error: invalid device context
Turn off FX graph cache can fix it:
import os
os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "0"
import torch
import torch._inductor.config as config
from torch import Tensor
config.cpp_wrapper = True
@torch.compile
def foo(x: Tensor):
return x.sin()
x = torch.tensor(0.0, device="cuda", requires_grad=True)
foo(x).backward()
print(x.grad)
This bug might be relevant to #144344
Versions
Collecting environment information...
PyTorch version: 2.7.0.dev20250110+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35
Python version: 3.12.8 (main, Dec 19 2024, 14:33:20) [Clang 18.1.8 ] (64-bit runtime)
Python platform: Linux-5.4.250-2-velinux1u1-amd64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.77
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 168
On-line CPU(s) list: 0-161
Off-line CPU(s) list: 162-167
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8457C
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 42
Socket(s): 2
Stepping: 8
BogoMIPS: 5199.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 3.9 MiB (84 instances)
L1i cache: 2.6 MiB (84 instances)
L2 cache: 168 MiB (84 instances)
L3 cache: 195 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-83
NUMA node1 CPU(s): 84-167
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Versions of relevant libraries:
[pip3] numpy==2.2.1
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pytorch-triton==3.2.0+git0d4682f0
[pip3] torch==2.7.0.dev20250110+cu126
[conda] Could not collect
cc @ezyang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv @chauhang @penguinwu @voznesenskym @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @aakhundov @BoyuanFeng