Skip to content

Conversation

heidongxianhua
Copy link
Contributor

@heidongxianhua heidongxianhua commented Jun 27, 2023

Fixes #ISSUE_NUMBER
there are some hard checks for cuda, so I make optimize the check so that we can run it for other device.

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 27, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104250

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f275cc5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@heidongxianhua
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 27, 2023
@heidongxianhua
Copy link
Contributor Author

@pytorchbot label "ciflow/inductor"

@heidongxianhua
Copy link
Contributor Author

@pytorchbot label "ciflow/trunk"

@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request labels Jun 27, 2023
@@ -49,7 +49,8 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
# batch_norm_gather_stats_with_counts calculates global mean & invstd based on
# all gathered mean, invstd and count.
# for nccl backend, use the optimized version of all gather.
if process_group._get_backend_name() == 'nccl':
# The Gloo backend does not support `all_gather_into_tensor`.
if process_group._get_backend_name() != "gloo":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @H-Huang does that look ok to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang @fegin sorry to bother could you have a look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail for backends like MPI (or other 3rd party backends) are not gloo but which also dont support all_gather_into_tensor. Which backend are you targeting for this change and is there urgency behind this feature?

A better way would be to introduce collective fallbacks which try a collective on a backend and if it is not available then fallback to another collective. We want to move away from hard checks for backends in our code to be more backend agnostic. However this feature is not available and still involves discussion/design from the distributed team

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail for backends like MPI (or other 3rd party backends) are not gloo but which also dont support all_gather_into_tensor. Which backend are you targeting for this change and is there urgency behind this feature?

A better way would be to introduce collective fallbacks which try a collective on a backend and if it is not available then fallback to another collective. We want to move away from hard checks for backends in our code to be more backend agnostic. However this feature is not available and still involves discussion/design from the distributed team

yeah I will make a change,and we are custom device with privateuse1 backend, it is a cuda-like device, we also want to support syncBN and this optimize.

@malfet malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 27, 2023
@heidongxianhua
Copy link
Contributor Author

could you give an approve and all the tests are ok? @albanD

@mikaylagawarecki mikaylagawarecki requested a review from H-Huang July 14, 2023 14:39
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds pretty safe to me so tentatively accepting!
If @H-Huang says it's not ok, we can revert!

@albanD
Copy link
Collaborator

albanD commented Jul 14, 2023

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_syncbn onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_syncbn && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@heidongxianhua
Copy link
Contributor Author

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_syncbn onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_syncbn && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@mikaylagawarecki
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@heidongxianhua heidongxianhua deleted the fix_syncbn branch January 8, 2025 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants