-
Notifications
You must be signed in to change notification settings - Fork 449
fix: properly gate usage of torch.cuda.device_count #3358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think we should remove |
booxter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to unbreak HPU. Ideally, if we have time and will, we eventually implement an abstract device counter that would handle all kinds of accelerators.
s-akhtar-baig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cdoern! I have added a minor comment, please take a look when you get a chance. Thanks!
since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues. use similar logic as we do in `init.py` to ensure CUDA is available, and HPU/ROCm are not. Signed-off-by: Charlie Doern <[email protected]>
|
@Mergifyio backport release-v0.26 |
✅ Backports have been createdDetails
|
…3359) since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues. use similar logic as we do in `init.py` to ensure CUDA is available, and HPU/ROCm are not. **Checklist:** - [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [ ] Documentation has been updated, if necessary. - [ ] Unit tests have been added, if necessary. - [ ] Functional tests have been added, if necessary. - [ ] E2E Workflow tests have been added, if necessary. <hr>This is an automatic backport of pull request #3358 done by [Mergify](https://mergify.com). Approved-by: booxter Approved-by: cdoern
since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues.
use similar logic as we do in
init.pyto ensure CUDA is available, and HPU/ROCm are not.Checklist:
conventional commits.