Skip to content

Conversation

@wphicks
Copy link
Contributor

@wphicks wphicks commented Feb 26, 2025

Due to an upstream change, CPU FIL was touching the CUDA context in a way that required a GPU to be present in order for it to be used. Until CPU FIL is actually included in the cuML CPU build (which will avoid this problem for anyone using the cuML CPU package), this change ensures that CPU FIL can still be run even if no GPU is available.

Due to an upstream change, CPU FIL was touching the CUDA context in a way that required a GPU to be present in order for it to be used. Until CPU FIL is actually included in the cuML CPU build (which will avoid this problem for anyone using the cuML CPU package), this change ensures that CPU FIL can still be run even if no GPU is available.
@wphicks wphicks requested a review from a team as a code owner February 26, 2025 20:03
@wphicks wphicks requested review from betatim and dantegd February 26, 2025 20:03
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Feb 26, 2025
@wphicks wphicks added bug Something isn't working non-breaking Non-breaking change cuml-cpu labels Feb 26, 2025
@wphicks
Copy link
Contributor Author

wphicks commented Feb 26, 2025

I'm not sure exactly how we want to go about testing this change (if at all). We only run cuML CPU tests in a GPU-less environment (which make sense), so until FIL is included in CPU builds, this would be the only thing that required a test of GPU cuML in a GPU-less environment. Because of this, I would recommend that we just work toward including FIL in the cuML CPU build and not worry about this for now.

@betatim
Copy link
Member

betatim commented Feb 27, 2025

The cause of this is that someone upstream changed something and that lead to the breakage? Was the change a bug/mistake or was it within the "API spec"?

Basically, how could we have noticed this? Maybe that helps figure out a way to test this that doesn't need running GPU cuml tests in a CPU only environment.

@wphicks
Copy link
Contributor Author

wphicks commented Mar 3, 2025

I'm afraid until we get FIL into the CPU build, we'd have to run GPU FIL in a GPU-less environment to catch something like this. In a CPU build, this whole behavior would be suppressed at compile time. In a GPU build, we're always going to need the code paths that do GPUish things, and the only comprehensive way to ensure they are not touched when we are using the CPU is to actually run them without a GPU and see what breaks. Regardless of where the error originates, I don't see a way around that. Even if we started to do something clever like intercepting calls to the cuda API, I imagine that would require a lot of work and still leave me less confident than if we simply ran tests with CUDA_VISIBLE_DEVICES=''.

I think the best long-term solution here is just to get FIL into the CPU build so we have a way around this problem altogether. When I originally designed FIL CPU, running the GPU build in a GPU-less environment was "out of scope," so it's largely chance that this was supported at all.

@wphicks
Copy link
Contributor Author

wphicks commented Mar 3, 2025

/merge

@rapids-bot rapids-bot bot merged commit 87ecca0 into rapidsai:branch-25.04 Mar 3, 2025
66 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants