-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Description
Describe the bug
When JIT-compiling a Triton kernel (specifically matmul_ogs
from triton_kernels
), the compiler generates PTX assembly that utilizes the .tile::gather4
instruction with .shared::cluster
as the destination state space.
The NVIDIA ptxas
assembler fails to compile this PTX code, reporting that this specific feature is not supported on the target architecture sm_121a
. This suggests that Triton's code generation for this architecture is producing an instruction that the hardware/driver toolchain does not support.
The issue occurs during a call to the matmul_ogs
kernel. The full PTX code generated by Triton is attached below (full trace in the file), which may help in debugging.
Summary:
Traceback (most recent call last):
File "/REDACTED.py", line 316, in REDACTED
REDACTED = matmul_ogs(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/triton_kernels/matmul_ogs.py", line 601, in matmul_ogs
(kernels._p_matmul_ogs if opt_flags.is_persistent else kernels._matmul_ogs)[(grid,)](
File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 419, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 733, in run
kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 861, in _do_compile
kernel = self.compile(src, target=target, options=options.__dict__)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 320, in compile
next_module = compile_ir(module, metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 520, in <lambda>
stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
raise PTXASError(error)
triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
`ptxas` stderr:
ptxas /tmp/tmpda2tgdg3.ptx, line 4253; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4258; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4262; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4266; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4270; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4274; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4278; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4282; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4286; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4290; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4294; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4298; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4302; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4306; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4310; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4314; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas fatal : Ptx assembly aborted due to errors
Repro command: /usr/local/cuda/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpda2tgdg3.ptx -o /tmp/tmpda2tgdg3.ptx.o
Environment details
- Triton version:
3.5.0
coming with PyTorch nightly. - GPU: DGX Spark, GB10,
sm_121a
. - CUDA Toolkit: CUDA 13.0.1