-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Use official CUDAToolkit module in CMake #154595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154595
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 1 Unrelated FailureAs of commit ae40a3d with merge base cf4964b ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
e67982f
to
c3f0359
Compare
4037656
to
9d5da10
Compare
ed247ae
to
22bb4d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on reviewing this, my review queue has been pretty backed up.
This is AMAZING!!!
The change sounds good to me (even though i'm in no way a cmake expert).
But if CI/CD is happy (including cpp extensions tests), I think we're good to go.
Let's try and land this as is!
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
22bb4d5
to
2594e27
Compare
@ngimel Detection of the native CPU architecture could be changed to One fix is using |
Fixed, see commit 3f789e9 . Also note that it has existed before this PR but has been revealed after these changes.. |
Do you know how this "native" option would work later when we are checking if the build is ok for the current GPU to give a clear error message on mismatch? |
@ngimel From nvcc documentation:
CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR tries to do too many things in one go (including renames)
Can it be split in 2-3 PRs, one of which would be using new CUDAToolkit package, but just define all the aliases that system used to, say set(CUDA_VERSION ${CUDAToolkit_VERSION})
etc?
Or alternatively, have a baseline PR that changes those in existing FindCUDA in preparation for new package version
Looks like there are some changes to how nvrtc package is defined before/after this change. In my opinion, it would be good to keep old definitions in place rather than pushing it to custom copy scripts, that will not be executed for users if they are running it outside of CI
os.system(f"unzip {wheel_path} -d {folder}/tmp") | ||
libs_to_copy = [ | ||
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12", | ||
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change is necessary if goal is just to remove FindCUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so. If we install cupti.so, we should also install nvperf_host.so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so
Could you link the failing jobs? I don't understand why we would need nvperf_*
libs now without changing profiling usage in PyTorch or CUPTI itself. Why and how was profiling working before?
nvperf_*
libs are used for pc sampling, pm sampling, sass metrics, or range profiling, and I don't see any related change in this PR so are we using these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with
torch::
prefix.cc @albanD