-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CUDA12] set_device change #94864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA12] set_device change #94864
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94864
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 88d7b8b: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
From offline discussion:
so this PR would most likely need to use a lazy loading approach. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
I've unlinked the PR internally, so next merge attempt should succeed, but let's not do it before weekend. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Hi @Aidyn-A! Soon after this PR landed, some long-standing FSDP unit tests have become flaky (#99011, #98821). I am not entirely sure the connection, but we see
More stacktrace (run on 8 GPUs)
I wonder if there could be any conflict with Perhaps, one possible remediation is to revert this PR for now? cc: @ezyang @ngimel |
@pytorchbot revert |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -m "causes flaky fsdp failures" -c weird |
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 94864 failedReason: Command
Details for Dev Infra teamRaised by workflow job |
This PR adds workaround for CUDA 12
cudaSetDevice
change which will always create primary context on target device. So operations like this:would always create primary context on on device
cuda:1
because it is creating a tensor on it and on devicecuda:0
because the destructor of CUDA Device guard callscudaSetDevice(0)
.After this PR the CUDA Device guard will not call
cudaSetDevice(0)
if primary context does not exist oncuda:0
.cc @ezyang @gchanan