add syncBN support for custom device #104250

heidongxianhua · 2023-06-27T11:53:28Z

Fixes #ISSUE_NUMBER
there are some hard checks for cuda, so I make optimize the check so that we can run it for other device.

pytorch-bot · 2023-06-27T11:53:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104250

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f275cc5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

heidongxianhua · 2023-06-27T12:06:46Z

@pytorchbot label "topic: not user facing"

heidongxianhua · 2023-06-27T12:06:58Z

@pytorchbot label "ciflow/inductor"

heidongxianhua · 2023-06-27T12:07:04Z

@pytorchbot label "ciflow/trunk"

torch/nn/modules/_functions.py

albanD · 2023-06-27T13:20:45Z

torch/nn/modules/_functions.py

@@ -49,7 +49,8 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
        # batch_norm_gather_stats_with_counts calculates global mean & invstd based on
        # all gathered mean, invstd and count.
        # for nccl backend, use the optimized version of all gather.
-        if process_group._get_backend_name() == 'nccl':
+        # The Gloo backend does not support `all_gather_into_tensor`.
+        if process_group._get_backend_name() != "gloo":


cc @H-Huang does that look ok to you?

@H-Huang @fegin sorry to bother could you have a look?

This would fail for backends like MPI (or other 3rd party backends) are not gloo but which also dont support all_gather_into_tensor. Which backend are you targeting for this change and is there urgency behind this feature?

A better way would be to introduce collective fallbacks which try a collective on a backend and if it is not available then fallback to another collective. We want to move away from hard checks for backends in our code to be more backend agnostic. However this feature is not available and still involves discussion/design from the distributed team

This would fail for backends like MPI (or other 3rd party backends) are not gloo but which also dont support all_gather_into_tensor. Which backend are you targeting for this change and is there urgency behind this feature?

A better way would be to introduce collective fallbacks which try a collective on a backend and if it is not available then fallback to another collective. We want to move away from hard checks for backends in our code to be more backend agnostic. However this feature is not available and still involves discussion/design from the distributed team

yeah I will make a change,and we are custom device with privateuse1 backend, it is a cuda-like device, we also want to support syncBN and this optimize.

heidongxianhua · 2023-07-14T13:20:48Z

could you give an approve and all the tests are ok? @albanD

albanD

That sounds pretty safe to me so tentatively accepting!
If @H-Huang says it's not ok, we can revert!

albanD · 2023-07-14T22:06:04Z

@pytorchbot merge -r

pytorchmergebot · 2023-07-14T22:07:55Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-07-14T22:08:15Z

Successfully rebased fix_syncbn onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_syncbn && git pull --rebase)

pytorchmergebot · 2023-07-14T22:09:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

heidongxianhua · 2023-07-15T00:55:12Z

@pytorchbot merge -r

pytorchmergebot · 2023-07-15T00:55:32Z

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

pytorchmergebot · 2023-07-15T00:57:27Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-07-15T00:57:31Z

Successfully rebased fix_syncbn onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_syncbn && git pull --rebase)

pytorchmergebot · 2023-07-15T00:58:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-07-15T01:13:49Z

Merge failed

Reason: 4 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

mikaylagawarecki · 2023-07-17T15:39:38Z

@pytorchbot merge

pytorchmergebot · 2023-07-17T15:41:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

heidongxianhua requested review from albanD, jbschlosser and mikaylagawarecki as code owners June 27, 2023 11:53

pytorchbot added the open source label Jun 27, 2023

pytorch-bot bot added the topic: not user facing topic category label Jun 27, 2023

pytorch-bot bot added ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request labels Jun 27, 2023

albanD reviewed Jun 27, 2023

View reviewed changes

malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 27, 2023

mikaylagawarecki requested a review from H-Huang July 14, 2023 14:39

albanD approved these changes Jul 14, 2023

View reviewed changes

pytorchmergebot force-pushed the fix_syncbn branch from a9e02ce to b979ae3 Compare July 14, 2023 22:08

pytorchmergebot added the merging label Jul 14, 2023

heidongxianhua added 2 commits July 15, 2023 00:57

fix syncBN

6d8d009

fix

f275cc5

pytorchmergebot force-pushed the fix_syncbn branch from b979ae3 to f275cc5 Compare July 15, 2023 00:57

pytorchmergebot removed the merging label Jul 15, 2023

pytorchmergebot added the merging label Jul 17, 2023

pytorchmergebot added Merged and removed merging labels Jul 17, 2023

pytorchmergebot closed this in 0585421 Jul 17, 2023

heidongxianhua deleted the fix_syncbn branch January 8, 2025 03:36

add syncBN support for custom device #104250

add syncBN support for custom device #104250

Uh oh!

Conversation

heidongxianhua commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104250

✅ No Failures

Uh oh!

heidongxianhua commented Jun 27, 2023

Uh oh!

heidongxianhua commented Jun 27, 2023

Uh oh!

heidongxianhua commented Jun 27, 2023

Uh oh!

Uh oh!

albanD Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

heidongxianhua Jul 7, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 17, 2023

Choose a reason for hiding this comment

Uh oh!

heidongxianhua Jul 17, 2023

Choose a reason for hiding this comment

Uh oh!

heidongxianhua commented Jul 14, 2023

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Jul 14, 2023

Uh oh!

pytorchmergebot commented Jul 14, 2023

Uh oh!

pytorchmergebot commented Jul 14, 2023

Uh oh!

pytorchmergebot commented Jul 14, 2023

Merge started

Uh oh!

heidongxianhua commented Jul 15, 2023

Uh oh!

pytorchmergebot commented Jul 15, 2023

Uh oh!

pytorchmergebot commented Jul 15, 2023

Uh oh!

pytorchmergebot commented Jul 15, 2023

Uh oh!

pytorchmergebot commented Jul 15, 2023

Merge started

Uh oh!

pytorchmergebot commented Jul 15, 2023

Merge failed

Uh oh!

mikaylagawarecki commented Jul 17, 2023

Uh oh!

pytorchmergebot commented Jul 17, 2023

Merge started

Uh oh!

Uh oh!

heidongxianhua commented Jun 27, 2023 •

edited

Loading

pytorch-bot bot commented Jun 27, 2023 •

edited

Loading