Skip to content

PTXAS compilation error: '.tile::gather4 with destination state space as .shared::cluster' not supported on target 'sm_121a' #8335

@yvbbrjdr

Description

@yvbbrjdr

Describe the bug

When JIT-compiling a Triton kernel (specifically matmul_ogs from triton_kernels), the compiler generates PTX assembly that utilizes the .tile::gather4 instruction with .shared::cluster as the destination state space.

The NVIDIA ptxas assembler fails to compile this PTX code, reporting that this specific feature is not supported on the target architecture sm_121a. This suggests that Triton's code generation for this architecture is producing an instruction that the hardware/driver toolchain does not support.

The issue occurs during a call to the matmul_ogs kernel. The full PTX code generated by Triton is attached below (full trace in the file), which may help in debugging.

triton.log

Summary:

Traceback (most recent call last):
  File "/REDACTED.py", line 316, in REDACTED
    REDACTED = matmul_ogs(
               ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton_kernels/matmul_ogs.py", line 601, in matmul_ogs
    (kernels._p_matmul_ogs if opt_flags.is_persistent else kernels._matmul_ogs)[(grid,)](
  File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 419, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 733, in run
    kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 861, in _do_compile
    kernel = self.compile(src, target=target, options=options.__dict__)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 320, in compile
    next_module = compile_ir(module, metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 520, in <lambda>
    stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
    raise PTXASError(error)
triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
`ptxas` stderr:
ptxas /tmp/tmpda2tgdg3.ptx, line 4253; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4258; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4262; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4266; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4270; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4274; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4278; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4282; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4286; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4290; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4294; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4298; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4302; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4306; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4310; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas /tmp/tmpda2tgdg3.ptx, line 4314; error   : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_121a'
ptxas fatal   : Ptx assembly aborted due to errors

Repro command: /usr/local/cuda/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpda2tgdg3.ptx -o /tmp/tmpda2tgdg3.ptx.o

Environment details

  • Triton version: 3.5.0 coming with PyTorch nightly.
  • GPU: DGX Spark, GB10, sm_121a.
  • CUDA Toolkit: CUDA 13.0.1

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions