Skip to content

Conversation

qiangwei1983
Copy link

What type of PR is this?

/kind feature

What this PR does / why we need it:
This PR improves support for Iluvatar GPUs, including device plugin (gpu-manager) adaptation, scheduling policy compatibility, and resource isolation and sharing capabilities.

Major changes include:

  • Added support for Iluvatar's new gpu-manager version
  • Support for Iluvatar multi-GPU models
  • Updated documentation and sample configurations
    Which issue(s) this PR fixes:
    Fixes #
    No existing issue

Special notes for your reviewer:

  • This feature is compatible with Iluvatar CoreX GPUs and has been preliminarily verified to function correctly in a test cluster.
  • Official or community users of Iluvatar are welcome to participate in testing and provide feedback.
  • Currently, it supports basic device discovery, resource reporting, Pod resource request, and scheduling, and will be expanded to support advanced features such as dynamic sharing.
  • Please pay special attention to changes in the following modules:
  • pkg/device/iluvatar/
  • Adjustments to device registration and scheduling parameters in cmd/
  • Updates to sample configurations and Helm values

Does this PR introduce a user-facing change?:
Yes, this PR introduces new user-visible features:

  • Added support for device management of different Iluvatar GPU models. Users can now schedule Iluvatar GPU resources based on GPU resource name using HAMi in Kubernetes clusters.
  • Users can specify iluvatar.ai/<card-type>. or related resource types (specific names vary by implementation) in Pod resource requests and use them with gpu-manager.
  • Related documentation and usage examples have been updated; see /examples/iluvatar/ and the updated README for details.

@hami-robot
Copy link
Contributor

hami-robot bot commented Oct 11, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: qiangwei1983
Once this PR has been reviewed and has the lgtm label, please assign wawa0210 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot
Copy link
Contributor

hami-robot bot commented Oct 11, 2025

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

  • 405256d Improved support for Iulvatar GPUs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hami-robot
Copy link
Contributor

hami-robot bot commented Oct 11, 2025

Welcome @qiangwei1983! It looks like this is your first PR to Project-HAMi/HAMi 🎉

@github-actions github-actions bot removed the kind/feature new function label Oct 11, 2025
@hami-robot hami-robot bot added the size/XL label Oct 11, 2025
Copy link
Contributor

Summary of Changes

Hello @qiangwei1983, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive improvements for Iluvatar GPU integration within the HAMi framework. The core objective is to provide more robust and flexible support for various Iluvatar GPU models, moving from a generic configuration to a model-specific approach. This enhancement allows users to precisely allocate and manage Iluvatar GPU resources in Kubernetes clusters, improving both scheduling policy compatibility and resource isolation capabilities. The changes span across core device logic, Helm chart configurations, and user-facing documentation and examples, ensuring a seamless experience for Iluvatar GPU users.

Highlights

  • Enhanced Iluvatar GPU Support: This pull request significantly improves support for Iluvatar GPUs by adapting to new gpu-manager versions and enabling compatibility with multiple Iluvatar GPU models. Users can now specify GPU resources based on specific card types.
  • Dynamic Resource Configuration: The system now dynamically configures Iluvatar GPU resources, including count, memory, and core names, based on the specific chip model. This replaces a previous static configuration approach, allowing for more flexible resource management.
  • Updated Helm Chart and Documentation: The Helm chart has been updated to include new configuration parameters for enabling Iluvatar device support and defining custom resources for various models. Corresponding documentation (both English and Chinese) has been revised to guide users on installation, resource requests, and device UUID annotations for the new multi-model support.
  • New Example Configurations: Several new example YAML files have been added for different Iluvatar GPU models (BI-V100, BI-V150, MR-V100) and usage scenarios, including default single-container, multi-container, and multi-device requests, demonstrating how to utilize the new resource naming conventions.
  • Refactored Device Initialization Logic: The internal device initialization logic for Iluvatar GPUs has been refactored to support a slice of configurations, allowing the system to handle multiple distinct Iluvatar GPU types simultaneously. Node device information is now decoded from annotations rather than inferred from node capacity.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves support for Iluvatar GPUs by refactoring the device logic to handle multiple GPU models through configuration, rather than hardcoded values. This is a great enhancement for maintainability and extensibility. The associated changes in Helm charts, documentation, and examples are mostly aligned with this new approach.

I have identified a few issues, including some invalid YAML in the documentation and example files which would prevent users from successfully using them. Additionally, the unit tests for the new Iluvatar configuration logic appear to be broken or incomplete, which could allow regressions in the future. I've also noted some minor areas for improvement in configuration files and documentation for better clarity and correctness. Please see my detailed comments for suggestions.

Comment on lines +86 to 87
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
- bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The YAML example for running an Iluvatar job is syntactically incorrect. The command key is missing, and - bash is incorrectly indented. This will cause a parsing error for users trying to follow the example.

Suggested change
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
- bash
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
command:

Comment on lines +87 to 88
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
- bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The YAML example for running an Iluvatar job is syntactically incorrect. The command key is missing, which will lead to a parsing error for users copying this example.

Suggested change
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
- bash
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
command:

containers:
- name: BI-V150-poddemo
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
- bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This YAML file is invalid because the command key is missing for the container definition. The line - bash should be placed under a command: key.

    command:
    - bash

nvidia.NvidiaGPUDevice: nvidia.NvidiaGPUCommonWord,
cambricon.CambriconMLUDevice: cambricon.CambriconMLUCommonWord,
hygon.HygonDCUDevice: hygon.HygonDCUCommonWord,
iluvatar.IluvatarGPUDevice: iluvatar.IluvatarGPUCommonWord,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

By removing the Iluvatar device from the expectedDevices map in setupTest, the test Test_InitDevicesWithConfig_Success no longer verifies the initialization of Iluvatar devices. Furthermore, other tests for Iluvatar configuration in this file, like Test_LoadConfig, appear to be broken due to the refactoring from a single iluvatar config object to a slice of iluvatars. The test configuration and assertions should be updated to reflect the new data structure and ensure the new logic is properly tested.

Comment on lines +395 to +406
- iluvatar.ai/BI-V100.vCore
- iluvatar.ai/BI-V100.vMem
- iluvatar.ai/BI-V100-vgpu
- iluvatar.ai/BI-V150.vCore
- iluvatar.ai/BI-V150.vMem
- iluvatar.ai/BI-V150-vgpu
- iluvatar.ai/MR-V100.vCore
- iluvatar.ai/MR-V100.vMem
- iluvatar.ai/MR-V100-vgpu
- iluvatar.ai/MR-V50-vgpu
- iluvatar.ai/MR-V50.vMem
- iluvatar.ai/MR-V50.vCore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and maintainability, it's a good practice to keep lists like this sorted. Please consider sorting the customresources for Iluvatar GPUs alphabetically by resource name, grouped by GPU model.

      - iluvatar.ai/BI-V100-vgpu
      - iluvatar.ai/BI-V100.vCore
      - iluvatar.ai/BI-V100.vMem
      - iluvatar.ai/BI-V150-vgpu
      - iluvatar.ai/BI-V150.vCore
      - iluvatar.ai/BI-V150.vMem
      - iluvatar.ai/MR-V100-vgpu
      - iluvatar.ai/MR-V100.vCore
      - iluvatar.ai/MR-V100.vMem
      - iluvatar.ai/MR-V50-vgpu
      - iluvatar.ai/MR-V50.vCore
      - iluvatar.ai/MR-V50.vMem

3. The `iluvatar.ai/<card-type>.vMem` resource is only effective when `iluvatar.ai/<card-type>-vgpu=1`.

4. Multi-device requests (`iluvatar.ai/vgpu > 1`) do not support vGPU mode.
4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu=1 > 1`) do not support vGPU mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in the note about multi-device requests. The expression iluvatar.ai/<card-type>-vgpu=1 > 1 is confusing. It should likely be iluvatar.ai/<card-type>-vgpu > 1 to indicate that the number of GPUs is greater than one.

Suggested change
4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu=1 > 1`) do not support vGPU mode.
4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu > 1`) do not support vGPU mode.

Comment on lines +105 to 108
if len(nodedevices) == 0 {
klog.InfoS("no gpu device found", "node", n.Name, "device annotation", devEncoded)
return []*device.DeviceInfo{}, errors.New("no gpu found on node")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When no GPU devices are found on a node, the code logs this at an Info level but returns an error. This is inconsistent. If an empty device list is considered an error, it should be logged as a Warning or Error. Additionally, the error message is generic. It would be more helpful to include the device type and node name in the error message.

Suggested change
if len(nodedevices) == 0 {
klog.InfoS("no gpu device found", "node", n.Name, "device annotation", devEncoded)
return []*device.DeviceInfo{}, errors.New("no gpu found on node")
}
if len(nodedevices) == 0 {
klog.WarningS("no gpu device found", "node", n.Name, "device annotation", devEncoded)
return []*device.DeviceInfo{}, fmt.Errorf("no %s gpus found on node %s", dev.config.ChipName, n.Name)
}

CambriconConfig cambricon.CambriconConfig `yaml:"cambricon"`
MthreadsConfig mthreads.MthreadsConfig `yaml:"mthreads"`
IluvatarConfig iluvatar.IluvatarConfig `yaml:"iluvatar"`
IluvatarConfig []iluvatar.IluvatarConfig `yaml:"iluvatars"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the move to support multiple Iluvatar device types via the iluvatars configuration, the old singular iluvatar configuration is now obsolete. To avoid confusion for users and developers, the old iluvatar section should be removed from the default configuration string configMapdata (lines 292-295).

@qiangwei1983 qiangwei1983 changed the title Improved support for Iulvatar GPUs Improved support for iluvatar GPUs Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant