-
Notifications
You must be signed in to change notification settings - Fork 25.2k
add syncBN support for custom device #104250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104250
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f275cc5: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
@pytorchbot label "ciflow/inductor" |
@pytorchbot label "ciflow/trunk" |
@@ -49,7 +49,8 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, | |||
# batch_norm_gather_stats_with_counts calculates global mean & invstd based on | |||
# all gathered mean, invstd and count. | |||
# for nccl backend, use the optimized version of all gather. | |||
if process_group._get_backend_name() == 'nccl': | |||
# The Gloo backend does not support `all_gather_into_tensor`. | |||
if process_group._get_backend_name() != "gloo": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @H-Huang does that look ok to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail for backends like MPI (or other 3rd party backends) are not gloo but which also dont support all_gather_into_tensor
. Which backend are you targeting for this change and is there urgency behind this feature?
A better way would be to introduce collective fallbacks which try a collective on a backend and if it is not available then fallback to another collective. We want to move away from hard checks for backends in our code to be more backend agnostic. However this feature is not available and still involves discussion/design from the distributed team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail for backends like MPI (or other 3rd party backends) are not gloo but which also dont support
all_gather_into_tensor
. Which backend are you targeting for this change and is there urgency behind this feature?A better way would be to introduce collective fallbacks which try a collective on a backend and if it is not available then fallback to another collective. We want to move away from hard checks for backends in our code to be more backend agnostic. However this feature is not available and still involves discussion/design from the distributed team
yeah I will make a change,and we are custom device with privateuse1 backend, it is a cuda-like device, we also want to support syncBN and this optimize.
could you give an approve and all the tests are ok? @albanD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds pretty safe to me so tentatively accepting!
If @H-Huang says it's not ok, we can revert!
@pytorchbot merge -r |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
a9e02ce
to
b979ae3
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot merge -r |
The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot. |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
b979ae3
to
f275cc5
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 4 mandatory check(s) failed. The first few are:
Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #ISSUE_NUMBER
there are some hard checks for
cuda
, so I make optimize the check so that we can run it for other device.