Skip to content

Conversation

@cdoern
Copy link
Contributor

@cdoern cdoern commented May 5, 2025

since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues.

use similar logic as we do in init.py to ensure CUDA is available, and HPU/ROCm are not.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Functional tests have been added, if necessary.
  • E2E Workflow tests have been added, if necessary.

@booxter
Copy link
Contributor

booxter commented May 5, 2025

I think we should remove Resolves https://github.com/instructlab/instructlab/issues/3357 otherwise it will close that issue and that would be undesirable. (We probably want to eventually fix it by correctly calculating devices on HPU).

Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to unbreak HPU. Ideally, if we have time and will, we eventually implement an abstract device counter that would handle all kinds of accelerators.

@mergify mergify bot added the one-approval PR has one approval from a maintainer label May 5, 2025
Copy link

@s-akhtar-baig s-akhtar-baig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cdoern! I have added a minor comment, please take a look when you get a chance. Thanks!

since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues.

use similar logic as we do in `init.py` to ensure CUDA is available, and HPU/ROCm are not.

Signed-off-by: Charlie Doern <[email protected]>
@cdoern
Copy link
Contributor Author

cdoern commented May 5, 2025

@Mergifyio backport release-v0.26

@mergify
Copy link
Contributor

mergify bot commented May 5, 2025

backport release-v0.26

✅ Backports have been created

Details

@mergify mergify bot removed the one-approval PR has one approval from a maintainer label May 5, 2025
@mergify mergify bot merged commit 4e056f4 into instructlab:main May 5, 2025
27 checks passed
mergify bot added a commit that referenced this pull request May 5, 2025
…3359)

since we support HPU and ROCm for inference, we need to gate torch.cuda.device_count since on HPU specificially this will cause issues.

use similar logic as we do in `init.py` to ensure CUDA is available, and HPU/ROCm are not.

**Checklist:**

- [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the
  [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary).
- [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release.
- [ ] Documentation has been updated, if necessary.
- [ ] Unit tests have been added, if necessary.
- [ ] Functional tests have been added, if necessary.
- [ ] E2E Workflow tests have been added, if necessary.
<hr>This is an automatic backport of pull request #3358 done by [Mergify](https://mergify.com).


Approved-by: booxter

Approved-by: cdoern
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants