Skip to content

Flash Attention 2 + Dynamo + FSDP accelerate plugin + torch.compile error #158186

@martinambrus

Description

@martinambrus

🐛 Describe the bug

When using the above-mentioned combination of packages (torch==2.7.1, flash_attn==2.8.1, accelerate==1.8.1) and using accelerate (accelerate launch ...) with FSDP and TorchDynamoPlugin plugins, as well as Flash Attention 2 enabled in one of the modules, I get the following error message:

lib/python3.13/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))

Versions

Collecting environment information...
PyTorch version: 2.7.1+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.28.0
Libc version: glibc-2.31

Python version: 3.13.2 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:02) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1089-azure-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
GPU 2: NVIDIA A100 80GB PCIe
GPU 3: NVIDIA A100 80GB PCIe

Nvidia driver version: 570.133.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        48 bits physical, 48 bits virtual
CPU(s):                               96
On-line CPU(s) list:                  0-95
Thread(s) per core:                   1
Core(s) per socket:                   48
Socket(s):                            2
NUMA node(s):                         4
Vendor ID:                            AuthenticAMD
CPU family:                           25
Model:                                1
Model name:                           AMD EPYC 7V13 64-Core Processor
Stepping:                             1
CPU MHz:                              2445.440
BogoMIPS:                             4890.88
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            3 MiB
L1i cache:                            3 MiB
L2 cache:                             48 MiB
L3 cache:                             384 MiB
NUMA node0 CPU(s):                    0-23
NUMA node1 CPU(s):                    24-47
NUMA node2 CPU(s):                    48-71
NUMA node3 CPU(s):                    72-95
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm

Versions of relevant libraries:
[pip3] lion-pytorch==0.2.3
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pytorch-lightning==2.5.1.post0
[pip3] pytorch-metric-learning==2.8.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.7.1
[pip3] torch-audiomentations==0.12.0
[pip3] torch-optimizer==0.3.0
[pip3] torch_pitch_shift==1.2.5
[pip3] torchaudio==2.7.1
[pip3] torchcrepe==0.0.24
[pip3] torchmetrics==1.7.1
[pip3] torchvision==0.22.1
[pip3] triton==3.3.1
[conda] blas                      1.0                         mkl  
[conda] cuda-cudart               12.4.127                      0    nvidia
[conda] cuda-cudart-dev           12.4.127                      0    nvidia
[conda] cuda-cudart-static        12.4.127                      0    nvidia
[conda] cuda-cupti                12.4.127                      0    nvidia
[conda] cuda-cupti-static         12.4.127                      0    nvidia
[conda] cuda-libraries            12.4.1               h06a4308_1  
[conda] cuda-libraries-dev        12.4.1               h06a4308_1  
[conda] cuda-libraries-static     12.4.1                        0    nvidia
[conda] cuda-nvrtc                12.4.127                      0    nvidia
[conda] cuda-nvrtc-dev            12.4.127                      0    nvidia
[conda] cuda-nvrtc-static         12.4.127                      0    nvidia
[conda] cuda-nvtx                 12.4.127                      0    nvidia
[conda] cuda-opencl               12.4.127                      0    nvidia
[conda] cuda-opencl-dev           12.4.127                      0    nvidia
[conda] cudatoolkit               11.1.1              hb139c0e_13    conda-forge
[conda] intel-openmp              2023.1.0         hdb19cb5_46306  
[conda] libcublas                 12.4.5.8                      0    nvidia
[conda] libcublas-dev             12.4.5.8                      0    nvidia
[conda] libcublas-static          12.4.5.8                      0    nvidia
[conda] libcufft                  11.2.1.3                      0    nvidia
[conda] libcufft-dev              11.2.1.3                      0    nvidia
[conda] libcufft-static           11.2.1.3                      0    nvidia
[conda] libcurand                 10.3.5.147                    0    nvidia
[conda] libcurand-dev             10.3.5.147                    0    nvidia
[conda] libcurand-static          10.3.5.147                    0    nvidia
[conda] libcusolver               11.6.1.9                      0    nvidia
[conda] libcusolver-dev           11.6.1.9                      0    nvidia
[conda] libcusolver-static        11.6.1.9                      0    nvidia
[conda] libcusparse               12.3.1.170                    0    nvidia
[conda] libcusparse-dev           12.3.1.170                    0    nvidia
[conda] libcusparse-static        12.3.1.170                    0    nvidia
[conda] libnvjitlink              12.4.127                      0    nvidia
[conda] libnvjitlink-dev          12.4.127                      0    nvidia
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-service               2.4.0            py38h5eee18b_1  
[conda] mkl_fft                   1.3.8            py38h5eee18b_0  
[conda] mkl_random                1.2.4            py38hdb19cb5_0  
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.77                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pytorch-lightning         2.4.0                    pypi_0    pypi
[conda] pytorch-metric-learning   2.7.0                    pypi_0    pypi
[conda] tbb                       2021.8.0             hdb19cb5_0  
[conda] torch                     2.4.1                    pypi_0    pypi
[conda] torch-audiomentations     0.11.1                   pypi_0    pypi
[conda] torch-pitch-shift         1.2.5                    pypi_0    pypi
[conda] torchaudio                0.8.0                      py38    pytorch
[conda] torchmetrics              1.5.1                    pypi_0    pypi
[conda] torchvision               0.15.2          cpu_py38h83e0c9b_0  
[conda] triton                    3.0.0                    pypi_0    pypi

cc @chauhang @penguinwu @zou3519 @bdhirsh

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: custom-operatorscustom operators, custom ops, custom-operators, custom-opsmodule: pt2-dispatcherPT2 dispatcher-related issues (e.g., aotdispatch, functionalization, faketensor, custom-op,module: sdpaAll things related to torch.nn.functional.scaled_dot_product_attentiiononcall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions