Inductor C++ Wrapper + autograd cause error in the second run because of FX graph cache

### 🐛 Describe the bug

```python
import torch
import torch._inductor.config as config
from torch import Tensor

config.cpp_wrapper = True


@torch.compile
def foo(x: Tensor):
    return x.sin()


x = torch.tensor(0.0, device="cuda", requires_grad=True)

foo(x).backward()
print(x.grad)
```
run this code __TWICE__ will get an error in the second run:
```
Traceback (most recent call last):
  File "/root/modded-nanogpt/custom_op_cache.py", line 15, in <module>
    foo(x).backward()
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 648, in backward
    torch.autograd.backward(
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1958, in backward
    return impl_fn()
           ^^^^^^^^^
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1944, in impl_fn
    out = CompiledFunction._backward_impl(ctx, all_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2079, in _backward_impl
    out = call_func_at_runtime_with_args(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 464, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/modded-nanogpt/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 2203, in run
    return model(new_inputs)
           ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_root/pw/cpwoz7xtew3ko7zejrn4bsrizhftvllcrykvty7vz5xn6v3zmkbp.py", line 262, in g
    output_handles = f(input_handles)
                     ^^^^^^^^^^^^^^^^
RuntimeError: CUDA driver error: invalid device context
```
Turn off FX graph cache can fix it:
```python
import os

os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "0"

import torch
import torch._inductor.config as config
from torch import Tensor

config.cpp_wrapper = True


@torch.compile
def foo(x: Tensor):
    return x.sin()


x = torch.tensor(0.0, device="cuda", requires_grad=True)

foo(x).backward()
print(x.grad)
```
This bug might be relevant to https://github.com/pytorch/pytorch/issues/144344

### Versions

Collecting environment information...
PyTorch version: 2.7.0.dev20250110+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.12.8 (main, Dec 19 2024, 14:33:20) [Clang 18.1.8 ] (64-bit runtime)
Python platform: Linux-5.4.250-2-velinux1u1-amd64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.77
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          168
On-line CPU(s) list:             0-161
Off-line CPU(s) list:            162-167
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8457C
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              42
Socket(s):                       2
Stepping:                        8
BogoMIPS:                        5199.99
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear arch_capabilities
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       3.9 MiB (84 instances)
L1i cache:                       2.6 MiB (84 instances)
L2 cache:                        168 MiB (84 instances)
L3 cache:                        195 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-83
NUMA node1 CPU(s):               84-167
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled

Versions of relevant libraries:
[pip3] numpy==2.2.1
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pytorch-triton==3.2.0+git0d4682f0
[pip3] torch==2.7.0.dev20250110+cu126
[conda] Could not collect

cc @ezyang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv @chauhang @penguinwu @voznesenskym @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @aakhundov @BoyuanFeng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inductor C++ Wrapper + autograd cause error in the second run because of FX graph cache #144609

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inductor C++ Wrapper + autograd cause error in the second run because of FX graph cache #144609

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions