-
Notifications
You must be signed in to change notification settings - Fork 25.3k
[CI] Enable UCC in CI #100395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Enable UCC in CI #100395
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100395
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit 235f0d1: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi can you please elaborate what is a UCC. I don't seem to know about it. |
@DuanBoomer UCC is unified collective communication library used for distributed parallel training. |
cc @atalman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase and merge by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -r |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
d6a21bc
to
235f0d1
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-rocm5_3-static-with-deps-cxx11-abi-test Details for Dev Infra teamRaised by workflow job |
Hmm, looks like rocm is broken:
Shall I force-merge it? |
@pytorchbot merge -f "Rocm failures are not related" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
UCC was temporarily disabled in #98832. This PR re-enables it with the necessary fix.
cc @malfet @seemethere @pytorch/pytorch-dev-infra