-
Notifications
You must be signed in to change notification settings - Fork 385
Improved support for iluvatar GPUs #1399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: qiangwei1983 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits. 📝 Please follow instructions in the contributing guide to update your commits with the DCO Full details of the Developer Certificate of Origin can be found at developercertificate.org. The list of commits missing DCO signoff:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Welcome @qiangwei1983! It looks like this is your first PR to Project-HAMi/HAMi 🎉 |
Summary of ChangesHello @qiangwei1983, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive improvements for Iluvatar GPU integration within the HAMi framework. The core objective is to provide more robust and flexible support for various Iluvatar GPU models, moving from a generic configuration to a model-specific approach. This enhancement allows users to precisely allocate and manage Iluvatar GPU resources in Kubernetes clusters, improving both scheduling policy compatibility and resource isolation capabilities. The changes span across core device logic, Helm chart configurations, and user-facing documentation and examples, ensuring a seamless experience for Iluvatar GPU users. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves support for Iluvatar GPUs by refactoring the device logic to handle multiple GPU models through configuration, rather than hardcoded values. This is a great enhancement for maintainability and extensibility. The associated changes in Helm charts, documentation, and examples are mostly aligned with this new approach.
I have identified a few issues, including some invalid YAML in the documentation and example files which would prevent users from successfully using them. Additionally, the unit tests for the new Iluvatar configuration logic appear to be broken or incomplete, which could allow regressions in the future. I've also noted some minor areas for improvement in configuration files and documentation for better clarity and correctness. Please see my detailed comments for suggestions.
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 | ||
- bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML example for running an Iluvatar job is syntactically incorrect. The command
key is missing, and - bash
is incorrectly indented. This will cause a parsing error for users trying to follow the example.
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 | |
- bash | |
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 | |
command: | |
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 | ||
- bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML example for running an Iluvatar job is syntactically incorrect. The command
key is missing, which will lead to a parsing error for users copying this example.
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 | |
- bash | |
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 | |
command: | |
containers: | ||
- name: BI-V150-poddemo | ||
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0 | ||
- bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvidia.NvidiaGPUDevice: nvidia.NvidiaGPUCommonWord, | ||
cambricon.CambriconMLUDevice: cambricon.CambriconMLUCommonWord, | ||
hygon.HygonDCUDevice: hygon.HygonDCUCommonWord, | ||
iluvatar.IluvatarGPUDevice: iluvatar.IluvatarGPUCommonWord, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By removing the Iluvatar device from the expectedDevices
map in setupTest
, the test Test_InitDevicesWithConfig_Success
no longer verifies the initialization of Iluvatar devices. Furthermore, other tests for Iluvatar configuration in this file, like Test_LoadConfig
, appear to be broken due to the refactoring from a single iluvatar
config object to a slice of iluvatars
. The test configuration and assertions should be updated to reflect the new data structure and ensure the new logic is properly tested.
- iluvatar.ai/BI-V100.vCore | ||
- iluvatar.ai/BI-V100.vMem | ||
- iluvatar.ai/BI-V100-vgpu | ||
- iluvatar.ai/BI-V150.vCore | ||
- iluvatar.ai/BI-V150.vMem | ||
- iluvatar.ai/BI-V150-vgpu | ||
- iluvatar.ai/MR-V100.vCore | ||
- iluvatar.ai/MR-V100.vMem | ||
- iluvatar.ai/MR-V100-vgpu | ||
- iluvatar.ai/MR-V50-vgpu | ||
- iluvatar.ai/MR-V50.vMem | ||
- iluvatar.ai/MR-V50.vCore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability and maintainability, it's a good practice to keep lists like this sorted. Please consider sorting the customresources
for Iluvatar GPUs alphabetically by resource name, grouped by GPU model.
- iluvatar.ai/BI-V100-vgpu
- iluvatar.ai/BI-V100.vCore
- iluvatar.ai/BI-V100.vMem
- iluvatar.ai/BI-V150-vgpu
- iluvatar.ai/BI-V150.vCore
- iluvatar.ai/BI-V150.vMem
- iluvatar.ai/MR-V100-vgpu
- iluvatar.ai/MR-V100.vCore
- iluvatar.ai/MR-V100.vMem
- iluvatar.ai/MR-V50-vgpu
- iluvatar.ai/MR-V50.vCore
- iluvatar.ai/MR-V50.vMem
3. The `iluvatar.ai/<card-type>.vMem` resource is only effective when `iluvatar.ai/<card-type>-vgpu=1`. | ||
|
||
4. Multi-device requests (`iluvatar.ai/vgpu > 1`) do not support vGPU mode. | ||
4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu=1 > 1`) do not support vGPU mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a typo in the note about multi-device requests. The expression iluvatar.ai/<card-type>-vgpu=1 > 1
is confusing. It should likely be iluvatar.ai/<card-type>-vgpu > 1
to indicate that the number of GPUs is greater than one.
4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu=1 > 1`) do not support vGPU mode. | |
4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu > 1`) do not support vGPU mode. |
if len(nodedevices) == 0 { | ||
klog.InfoS("no gpu device found", "node", n.Name, "device annotation", devEncoded) | ||
return []*device.DeviceInfo{}, errors.New("no gpu found on node") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When no GPU devices are found on a node, the code logs this at an Info
level but returns an error. This is inconsistent. If an empty device list is considered an error, it should be logged as a Warning
or Error
. Additionally, the error message is generic. It would be more helpful to include the device type and node name in the error message.
if len(nodedevices) == 0 { | |
klog.InfoS("no gpu device found", "node", n.Name, "device annotation", devEncoded) | |
return []*device.DeviceInfo{}, errors.New("no gpu found on node") | |
} | |
if len(nodedevices) == 0 { | |
klog.WarningS("no gpu device found", "node", n.Name, "device annotation", devEncoded) | |
return []*device.DeviceInfo{}, fmt.Errorf("no %s gpus found on node %s", dev.config.ChipName, n.Name) | |
} | |
CambriconConfig cambricon.CambriconConfig `yaml:"cambricon"` | ||
MthreadsConfig mthreads.MthreadsConfig `yaml:"mthreads"` | ||
IluvatarConfig iluvatar.IluvatarConfig `yaml:"iluvatar"` | ||
IluvatarConfig []iluvatar.IluvatarConfig `yaml:"iluvatars"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the move to support multiple Iluvatar device types via the iluvatars
configuration, the old singular iluvatar
configuration is now obsolete. To avoid confusion for users and developers, the old iluvatar
section should be removed from the default configuration string configMapdata
(lines 292-295).
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR improves support for Iluvatar GPUs, including device plugin (gpu-manager) adaptation, scheduling policy compatibility, and resource isolation and sharing capabilities.
Major changes include:
Which issue(s) this PR fixes:
Fixes #
No existing issue
Special notes for your reviewer:
pkg/device/iluvatar/
cmd/
Does this PR introduce a user-facing change?:
Yes, this PR introduces new user-visible features:
iluvatar.ai/<card-type>.
or related resource types (specific names vary by implementation) in Pod resource requests and use them with gpu-manager./examples/iluvatar/
and the updated README for details.