[SDPA] Add an optional scale kwarg #95259

drisspg · 2023-02-22T01:36:47Z

Summary

This PR adds an optional kwarg to torch torch.nn.functional.scaled_dot_product_attention()
The new kwarg is a scaling factor that is applied after the [email protected] step of the computation. Made updates to the efficient kernel to support but flash and math were minimally updated to support as well.

Will reduce the complexity of: #94729 and has been asked for by a couple of users.

Review Highlights

As far as I know I did this the correct way and this both BC and FC compliant. However I always seem to break internal workloads so I would love if someone can advice I did this right?
I named the optional arg 'scale'. This is probably dumb and I should name it 'scale_factor'. I will make this change but this is annoying and it will require someone thinking we should rename.
'scale' is interpreted as [email protected] * (scale)

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

pytorch-bot · 2023-02-22T01:36:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95259

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 2 Pending

As of commit ac517f3:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2023-02-22T03:10:59Z

aten/src/ATen/native/transformers/attention.h

@@ -3,12 +3,13 @@
 #include <c10/macros/Export.h>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/transformers/attention.h>
+#include "c10/util/Optional.h"


aten/src/ATen/native/native_functions.yaml

aten/src/ATen/native/transformers/cuda/attention.cu

cpuhrsch · 2023-02-22T15:57:39Z

aten/src/ATen/native/transformers/cuda/attention_backward.cu

@@ -247,6 +250,8 @@ std::tuple<at::Tensor, at::Tensor, at::Tensor> _efficient_attention_backward(
    p.grad_key_ptr = (scalar_t*)grad_k.data_ptr();
    p.grad_value_ptr = (scalar_t*)grad_v.data_ptr();
    p.delta_ptr = (float*)delta.data_ptr();
+    p.scale = scale.has_value() ? 1.0f / scale.value()


It seems easier to not make scale optional and do this ternary dance everywhere

Okay figured out why the dance,
Scale needs to be the last arg, but this would come after args with default args. So need a good default hence -> error: missing default argument on parameter 'scale'

test/test_transformers.py

aten/src/ATen/native/native_functions.yaml

drisspg

Make sure that the dynamo graph break mechanism handles these new kwarg gracefully. It should its kwarg only but update the tests

aten/src/ATen/native/transformers/attention.cpp

aten/src/ATen/native/transformers/cuda/attention.cu

aten/src/ATen/native/transformers/cuda/attention_backward.cu

aten/src/ATen/native/native_functions.yaml

aten/src/ATen/native/transformers/sdp_utils_cpp.h

torch/nn/functional.py

cpuhrsch

See leftover comments. Otherwise this seems good to go.

drisspg · 2023-03-08T14:35:47Z

@pytorchbot merge

pytorchmergebot · 2023-03-08T14:39:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

# Summary This PR adds an optional kwarg to torch torch.nn.functional.scaled_dot_product_attention() The new kwarg is a scaling factor that is applied after the [email protected] step of the computation. Made updates to the efficient kernel to support but flash and math were minimally updated to support as well. Will reduce the complexity of: #94729 and has been asked for by a couple of users. # Review Highlights - As far as I know I did this the correct way and this both BC and FC compliant. However I always seem to break internal workloads so I would love if someone can advice I did this right? - I named the optional arg 'scale'. This is probably dumb and I should name it 'scale_factor'. I will make this change but this is annoying and it will require someone thinking we should rename. - 'scale' is interpreted as `[email protected] * (scale)` Pull Request resolved: pytorch/pytorch#95259 Approved by: https://github.com/cpuhrsch

# Summary This PR adds an optional kwarg to torch torch.nn.functional.scaled_dot_product_attention() The new kwarg is a scaling factor that is applied after the [email protected] step of the computation. Made updates to the efficient kernel to support but flash and math were minimally updated to support as well. Will reduce the complexity of: pytorch#94729 and has been asked for by a couple of users. # Review Highlights - As far as I know I did this the correct way and this both BC and FC compliant. However I always seem to break internal workloads so I would love if someone can advice I did this right? - I named the optional arg 'scale'. This is probably dumb and I should name it 'scale_factor'. I will make this change but this is annoying and it will require someone thinking we should rename. - 'scale' is interpreted as `[email protected] * (scale)` Pull Request resolved: pytorch#95259 Approved by: https://github.com/cpuhrsch

drisspg force-pushed the sdpa_optional_scale_kwarg branch 2 times, most recently from 6f295cc to 143b654 Compare February 22, 2023 03:00

drisspg marked this pull request as ready for review February 22, 2023 03:03

drisspg requested review from albanD, jbschlosser and soulitzer as code owners February 22, 2023 03:03

drisspg added release notes: nn release notes category topic: new features topic category labels Feb 22, 2023

drisspg requested a review from cpuhrsch February 22, 2023 03:10

drisspg commented Feb 22, 2023

View reviewed changes