fix: properly gate usage of torch.cuda.device_count #3358

cdoern · 2025-05-05T13:49:37Z

since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues.

use similar logic as we do in init.py to ensure CUDA is available, and HPU/ROCm are not.

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Changelog updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Functional tests have been added, if necessary.
E2E Workflow tests have been added, if necessary.

booxter · 2025-05-05T14:01:56Z

I think we should remove Resolves https://github.com/instructlab/instructlab/issues/3357 otherwise it will close that issue and that would be undesirable. (We probably want to eventually fix it by correctly calculating devices on HPU).

booxter

This seems reasonable to unbreak HPU. Ideally, if we have time and will, we eventually implement an abstract device counter that would handle all kinds of accelerators.

s-akhtar-baig

Hey @cdoern! I have added a minor comment, please take a look when you get a chance. Thanks!

src/instructlab/model/serve_backend.py

since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues. use similar logic as we do in `init.py` to ensure CUDA is available, and HPU/ROCm are not. Signed-off-by: Charlie Doern <[email protected]>

cdoern · 2025-05-05T14:50:39Z

@Mergifyio backport release-v0.26

mergify · 2025-05-05T14:50:50Z

backport release-v0.26

✅ Backports have been created

Details

#3359 fix: properly gate usage of torch.cuda.device_count (backport #3358) has been created for branch release-v0.26

…3359) since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues. use similar logic as we do in `init.py` to ensure CUDA is available, and HPU/ROCm are not. **Checklist:** - [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [ ] Documentation has been updated, if necessary. - [ ] Unit tests have been added, if necessary. - [ ] Functional tests have been added, if necessary. - [ ] E2E Workflow tests have been added, if necessary. <hr>This is an automatic backport of pull request #3358 done by [Mergify](https://mergify.com). Approved-by: booxter Approved-by: cdoern

booxter approved these changes May 5, 2025

View reviewed changes

mergify bot added the one-approval PR has one approval from a maintainer label May 5, 2025

s-akhtar-baig reviewed May 5, 2025

View reviewed changes

src/instructlab/model/serve_backend.py Outdated Show resolved Hide resolved

cdoern force-pushed the gaudi-fix branch from 1f6d46d to 9b05e24 Compare May 5, 2025 14:24

alinaryan approved these changes May 5, 2025

View reviewed changes

mergify bot removed the one-approval PR has one approval from a maintainer label May 5, 2025

jaideepr97 approved these changes May 5, 2025

View reviewed changes

mergify bot merged commit 4e056f4 into instructlab:main May 5, 2025
27 checks passed

mergify bot mentioned this pull request May 5, 2025

fix: properly gate usage of torch.cuda.device_count (backport #3358) #3359

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: properly gate usage of torch.cuda.device_count #3358

fix: properly gate usage of torch.cuda.device_count #3358

Uh oh!

cdoern commented May 5, 2025 •

edited

Loading

Uh oh!

booxter commented May 5, 2025

Uh oh!

booxter left a comment

Uh oh!

s-akhtar-baig left a comment

Uh oh!

Uh oh!

cdoern commented May 5, 2025

Uh oh!

mergify bot commented May 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: properly gate usage of torch.cuda.device_count #3358

fix: properly gate usage of torch.cuda.device_count #3358

Uh oh!

Conversation

cdoern commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

booxter commented May 5, 2025

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

s-akhtar-baig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cdoern commented May 5, 2025

Uh oh!

mergify bot commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cdoern commented May 5, 2025 •

edited

Loading

mergify bot commented May 5, 2025 •

edited

Loading