Fix scaled_dot_product_attention reference implementation to match MATH backend #163508

tom-pollak · 2025-09-22T13:05:37Z

Fixes scaled_dot_product_attention: different output between the reference implementation and the one from torch.nn.functional #119188 (exact match on MATH backend)
Fixes torch.softmax(inp, dtype=torch.float32).to(torch.float16) is not equivalent to torch.softmax(inp) for fp16 input #123911 (document float32 upcasting behaviour)

Summary

The documented reference implementation of SDPA doesn't numerically match the MATH. This causes confusion when testing numerical accuracy of kernels / code vs MATH.

Changes

Updated the reference implementation to match MATH's actual behavior. The key corrections are:

The MATH backend pre-scales both query and key tensors before matmul for numerical stability, rather than scaling after the matmul operation.
The MATH backend internally upcasts to float32 for fp16/bf16 inputs, then converts back to original dtype at the end.

Added regression test test_reference_implementation_bitwise_match_math_backend. Test verifies exact bitwise match (rtol=0, atol=0) between the reference implementation and MATH.

MATH backend scales Q K pre-softmax, leading to numerical differences when comparing the ref impl with MATH. Now matches SDPBackend.MATH with `rotl=0., atol=0.`

pytorch-bot · 2025-09-22T13:05:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163508

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 7cb5f4f with merge base 96a3afb ():

NEW FAILURES - The following jobs have failed:

trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 2, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_transformers.py::TestSDPACUDA::test_reference_implementation_bitwise_match_math_backend_bfloat16_n_heads0_is_causal_True_allow_reduced_precision_False_cuda_bfloat16
trunk / win-vs2022-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)
test_fx.py::TestDCE::test_keep_collectives

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tom-pollak · 2025-09-22T13:11:50Z

@pytorchbot label "module: sdpa"

tom-pollak · 2025-09-22T13:12:23Z

@pytorchbot label "release notes: nn"

Just whitespace, but no longer a copy-paste from docs

jbschlosser

Thanks for the contribution! I agree with the fixes and it's nice to have bitwise-accurate validation against the math backend via tests.

One concern I have is that the test won't catch updates to the reference implementation in the docs. Running this via the doctest mechanism would address this. cc @svekars for insight on how to ensure this validation happens during doctest time

Don't think they are seeded the same way on some archs, leading to different dropout

tom-pollak · 2025-09-22T16:52:23Z

Thanks! Agree, originally I had a doctest but I thought it might pollute the page since it would have to be immediately below the code block. This way it should still catch regressions to the MATH kernel, which seems more likely to breaking change. Happy to go either way though.

jbschlosser · 2025-09-22T16:59:34Z

Agree, originally I had a doctest but I thought it might pollute the page since it would have to be immediately below the code block.

If it's just a few lines to compare reference vs. math, my opinion is that this doesn't pollute the page too much. In fact, it makes it very clear that the reference in the docs is what users should expect from the math backend.

tom-pollak · 2025-09-23T09:49:28Z

Looking into this: I'm not sure there's a good way to integrate with doctest. SDPA is not defined in a def, xdoctest doesn’t discover it when scanning (even with --analysis=dynamic).

The current sdpa examples aren't actually run with xdoctest, you can test this with:

export XDOCTEST_GLOBAL_EXEC="from torch import nn\nimport torch.nn.functional as F\nimport torch"
xdoctest -m torch.nn.functional --analysis dynamic
<examples are run, but not sdpa>

I could add a "fake" doctest in, but that might be more confusing, or I think we'd need to refactor.

torch/nn/functional.py

tom-pollak · 2025-09-29T10:18:30Z

@drisspg Ok to be merged?

drisspg

Looks good, thanks

tom-pollak · 2025-10-02T09:00:44Z

@pytorchbot merge

pytorchmergebot · 2025-10-02T09:03:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-02T09:35:09Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 2, 5, linux.g6.4xlarge.experimental.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

tom-pollak · 2025-10-02T10:16:08Z

@drisspg my bad, needed to put temp_mask on the attn_bias device. Fixed now!

tom-pollak added 2 commits September 22, 2025 14:01

Fix SDPA reference implementation

4119e96

MATH backend scales Q K pre-softmax, leading to numerical differences when comparing the ref impl with MATH. Now matches SDPBackend.MATH with `rotl=0., atol=0.`

Add SDPA regression test

51cd7f7

tom-pollak requested review from albanD, jbschlosser and mikaylagawarecki as code owners September 22, 2025 13:05

pytorch-bot bot added the module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion label Sep 22, 2025

pytorch-bot bot added the release notes: nn release notes category label Sep 22, 2025

pytorchbot added the open source label Sep 22, 2025

Fix lints

bbb2686

Just whitespace, but no longer a copy-paste from docs

albanD removed their request for review September 22, 2025 14:13

jbschlosser requested a review from drisspg September 22, 2025 15:52

jbschlosser reviewed Sep 22, 2025

View reviewed changes

Remove dropout_p

a2f10ef

Don't think they are seeded the same way on some archs, leading to different dropout

Remove dropout_p (squash)

1605d1d

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 22, 2025

drisspg reviewed Sep 24, 2025

View reviewed changes

torch/nn/functional.py Show resolved Hide resolved

tom-pollak added 2 commits September 24, 2025 11:16

put upcast behind reduced precision flag

12f4448

Add reduced precision flag param test

ba2df6c

tom-pollak requested a review from drisspg September 24, 2025 10:21

clean up spacing

7cb5f4f

drisspg approved these changes Oct 1, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 2, 2025

pytorchmergebot added the merging label Oct 2, 2025

pytorchmergebot removed the merging label Oct 2, 2025

set temp_mask device

bb5f086

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Oct 2, 2025

Fix scaled_dot_product_attention reference implementation to match MATH backend #163508

Are you sure you want to change the base?

Fix scaled_dot_product_attention reference implementation to match MATH backend #163508

Uh oh!

Conversation

tom-pollak commented Sep 22, 2025

Summary

Changes

Uh oh!

pytorch-bot bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163508

❌ 2 New Failures

Uh oh!

tom-pollak commented Sep 22, 2025

Uh oh!

tom-pollak commented Sep 22, 2025

Uh oh!

jbschlosser left a comment

Choose a reason for hiding this comment

Uh oh!

tom-pollak commented Sep 22, 2025

Uh oh!

jbschlosser commented Sep 22, 2025

Uh oh!

tom-pollak commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tom-pollak commented Sep 29, 2025

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

tom-pollak commented Oct 2, 2025

Uh oh!

pytorchmergebot commented Oct 2, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 2, 2025

Merge failed

Uh oh!

tom-pollak commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Sep 22, 2025 •

edited

Loading

tom-pollak commented Sep 23, 2025 •

edited

Loading