Skip to content

Conversation

Aidyn-A
Copy link
Collaborator

@Aidyn-A Aidyn-A commented May 1, 2023

UCC was temporarily disabled in #98832. This PR re-enables it with the necessary fix.

cc @malfet @seemethere @pytorch/pytorch-dev-infra

@pytorch-bot
Copy link

pytorch-bot bot commented May 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100395

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 235f0d1:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Aidyn-A Aidyn-A requested a review from jeffdaily as a code owner May 1, 2023 19:01
@Aidyn-A Aidyn-A marked this pull request as draft May 1, 2023 19:12
@Aidyn-A Aidyn-A force-pushed the c121_enable_ucc branch from 813f9db to 5c9c1b8 Compare May 1, 2023 19:36
@DuanBoomer
Copy link

Hi can you please elaborate what is a UCC. I don't seem to know about it.

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented May 2, 2023

@DuanBoomer UCC is unified collective communication library used for distributed parallel training.

@Aidyn-A Aidyn-A marked this pull request as ready for review May 2, 2023 16:58
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented May 2, 2023

cc @atalman

@Aidyn-A Aidyn-A changed the title [WIP] [CI] Enable UCC in CI [CI] Enable UCC in CI May 18, 2023
@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label May 18, 2023
@jbschlosser jbschlosser added module: build Build system issues module: ci Related to continuous integration labels May 19, 2023
@jbschlosser jbschlosser requested a review from atalman May 19, 2023 20:49
@jbschlosser jbschlosser added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 19, 2023
@huydhn huydhn added ciflow/trunk Trigger trunk jobs on your pull request ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow labels May 19, 2023
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@atalman atalman added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Jun 6, 2023
@atalman
Copy link
Contributor

atalman commented Jun 7, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase and merge by leaving the following comment on this PR:
@pytorchbot merge -r
Or just rebase by leaving @pytorchbot rebase comment

Details for Dev Infra team Raised by workflow job

@atalman
Copy link
Contributor

atalman commented Jun 7, 2023

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased c121_enable_ucc onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout c121_enable_ucc && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-rocm5_3-static-with-deps-cxx11-abi-test

Details for Dev Infra team Raised by workflow job

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Jun 8, 2023

Hmm, looks like rocm is broken:

./simple-torch-test: error while loading shared libraries: libhipsolver.so.0: cannot open shared object file: No such file or directory

Shall I force-merge it?

@atalman
Copy link
Contributor

atalman commented Jun 8, 2023

@pytorchbot merge -f "Rocm failures are not related"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow Merged module: build Build system issues module: ci Related to continuous integration open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants