Improved support for iluvatar GPUs #1399

qiangwei1983 · 2025-10-11T03:43:18Z

What type of PR is this?

/kind feature

What this PR does / why we need it:
This PR improves support for Iluvatar GPUs, including device plugin (gpu-manager) adaptation, scheduling policy compatibility, and resource isolation and sharing capabilities.

Major changes include:

Added support for Iluvatar's new gpu-manager version
Support for Iluvatar multi-GPU models
Updated documentation and sample configurations
Which issue(s) this PR fixes:
Fixes #
No existing issue

Special notes for your reviewer:

This feature is compatible with Iluvatar CoreX GPUs and has been preliminarily verified to function correctly in a test cluster.
Official or community users of Iluvatar are welcome to participate in testing and provide feedback.
Currently, it supports basic device discovery, resource reporting, Pod resource request, and scheduling, and will be expanded to support advanced features such as dynamic sharing.
Please pay special attention to changes in the following modules:
pkg/device/iluvatar/
Adjustments to device registration and scheduling parameters in cmd/
Updates to sample configurations and Helm values

Does this PR introduce a user-facing change?:
Yes, this PR introduces new user-visible features:

Added support for device management of different Iluvatar GPU models. Users can now schedule Iluvatar GPU resources based on GPU resource name using HAMi in Kubernetes clusters.
Users can specify iluvatar.ai/<card-type>. or related resource types (specific names vary by implementation) in Pod resource requests and use them with gpu-manager.
Related documentation and usage examples have been updated; see /examples/iluvatar/ and the updated README for details.

hami-robot · 2025-10-11T03:43:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: qiangwei1983
Once this PR has been reviewed and has the lgtm label, please assign wawa0210 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2025-10-11T03:43:21Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

405256d Improved support for Iulvatar GPUs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hami-robot · 2025-10-11T03:43:26Z

Welcome @qiangwei1983! It looks like this is your first PR to Project-HAMi/HAMi 🎉

gemini-code-assist · 2025-10-11T03:43:46Z

Summary of Changes

Hello @qiangwei1983, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive improvements for Iluvatar GPU integration within the HAMi framework. The core objective is to provide more robust and flexible support for various Iluvatar GPU models, moving from a generic configuration to a model-specific approach. This enhancement allows users to precisely allocate and manage Iluvatar GPU resources in Kubernetes clusters, improving both scheduling policy compatibility and resource isolation capabilities. The changes span across core device logic, Helm chart configurations, and user-facing documentation and examples, ensuring a seamless experience for Iluvatar GPU users.

Highlights

Enhanced Iluvatar GPU Support: This pull request significantly improves support for Iluvatar GPUs by adapting to new gpu-manager versions and enabling compatibility with multiple Iluvatar GPU models. Users can now specify GPU resources based on specific card types.
Dynamic Resource Configuration: The system now dynamically configures Iluvatar GPU resources, including count, memory, and core names, based on the specific chip model. This replaces a previous static configuration approach, allowing for more flexible resource management.
Updated Helm Chart and Documentation: The Helm chart has been updated to include new configuration parameters for enabling Iluvatar device support and defining custom resources for various models. Corresponding documentation (both English and Chinese) has been revised to guide users on installation, resource requests, and device UUID annotations for the new multi-model support.
New Example Configurations: Several new example YAML files have been added for different Iluvatar GPU models (BI-V100, BI-V150, MR-V100) and usage scenarios, including default single-container, multi-container, and multi-device requests, demonstrating how to utilize the new resource naming conventions.
Refactored Device Initialization Logic: The internal device initialization logic for Iluvatar GPUs has been refactored to support a slice of configurations, allowing the system to handle multiple distinct Iluvatar GPU types simultaneously. Node device information is now decoded from annotations rather than inferred from node capacity.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly improves support for Iluvatar GPUs by refactoring the device logic to handle multiple GPU models through configuration, rather than hardcoded values. This is a great enhancement for maintainability and extensibility. The associated changes in Helm charts, documentation, and examples are mostly aligned with this new approach.

I have identified a few issues, including some invalid YAML in the documentation and example files which would prevent users from successfully using them. Additionally, the unit tests for the new Iluvatar configuration logic appear to be broken or incomplete, which could allow regressions in the future. I've also noted some minor areas for improvement in configuration files and documentation for better clarity and correctness. Please see my detailed comments for suggestions.

gemini-code-assist · 2025-10-11T03:45:37Z

docs/iluvatar-gpu-support.md

+    image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
    - bash


The YAML example for running an Iluvatar job is syntactically incorrect. The command key is missing, and - bash is incorrectly indented. This will cause a parsing error for users trying to follow the example.

Suggested change

image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0

- bash

image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0

command:

gemini-code-assist · 2025-10-11T03:45:37Z

docs/iluvatar-gpu-support_cn.md

+    image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
    - bash


The YAML example for running an Iluvatar job is syntactically incorrect. The command key is missing, which will lead to a parsing error for users copying this example.

Suggested change

image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0

- bash

image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0

command:

gemini-code-assist · 2025-10-11T03:45:37Z

examples/iluvatar/default_use_BI-V150.yaml

+  containers:
+  - name: BI-V150-poddemo
+    image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
+    - bash


This YAML file is invalid because the command key is missing for the container definition. The line - bash should be placed under a command: key.

command: - bash

gemini-code-assist · 2025-10-11T03:45:37Z

pkg/scheduler/config/config_test.go

 		nvidia.NvidiaGPUDevice:       nvidia.NvidiaGPUCommonWord,
 		cambricon.CambriconMLUDevice: cambricon.CambriconMLUCommonWord,
 		hygon.HygonDCUDevice:         hygon.HygonDCUCommonWord,
-		iluvatar.IluvatarGPUDevice:   iluvatar.IluvatarGPUCommonWord,


By removing the Iluvatar device from the expectedDevices map in setupTest, the test Test_InitDevicesWithConfig_Success no longer verifies the initialization of Iluvatar devices. Furthermore, other tests for Iluvatar configuration in this file, like Test_LoadConfig, appear to be broken due to the refactoring from a single iluvatar config object to a slice of iluvatars. The test configuration and assertions should be updated to reflect the new data structure and ensure the new logic is properly tested.

gemini-code-assist · 2025-10-11T03:45:38Z

charts/hami/values.yaml

+      - iluvatar.ai/BI-V100.vCore
+      - iluvatar.ai/BI-V100.vMem
+      - iluvatar.ai/BI-V100-vgpu
+      - iluvatar.ai/BI-V150.vCore
+      - iluvatar.ai/BI-V150.vMem
+      - iluvatar.ai/BI-V150-vgpu
+      - iluvatar.ai/MR-V100.vCore
+      - iluvatar.ai/MR-V100.vMem
+      - iluvatar.ai/MR-V100-vgpu
+      - iluvatar.ai/MR-V50-vgpu
+      - iluvatar.ai/MR-V50.vMem
+      - iluvatar.ai/MR-V50.vCore


For better readability and maintainability, it's a good practice to keep lists like this sorted. Please consider sorting the customresources for Iluvatar GPUs alphabetically by resource name, grouped by GPU model.

- iluvatar.ai/BI-V100-vgpu - iluvatar.ai/BI-V100.vCore - iluvatar.ai/BI-V100.vMem - iluvatar.ai/BI-V150-vgpu - iluvatar.ai/BI-V150.vCore - iluvatar.ai/BI-V150.vMem - iluvatar.ai/MR-V100-vgpu - iluvatar.ai/MR-V100.vCore - iluvatar.ai/MR-V100.vMem - iluvatar.ai/MR-V50-vgpu - iluvatar.ai/MR-V50.vCore - iluvatar.ai/MR-V50.vMem

gemini-code-assist · 2025-10-11T03:45:38Z

docs/iluvatar-gpu-support.md

+3. The `iluvatar.ai/<card-type>.vMem` resource is only effective when `iluvatar.ai/<card-type>-vgpu=1`.

-4. Multi-device requests (`iluvatar.ai/vgpu > 1`) do not support vGPU mode.
+4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu=1 > 1`) do not support vGPU mode.


There appears to be a typo in the note about multi-device requests. The expression iluvatar.ai/<card-type>-vgpu=1 > 1 is confusing. It should likely be iluvatar.ai/<card-type>-vgpu > 1 to indicate that the number of GPUs is greater than one.

Suggested change

4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu=1 > 1`) do not support vGPU mode.

4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu > 1`) do not support vGPU mode.

gemini-code-assist · 2025-10-11T03:45:38Z

pkg/device/iluvatar/device.go

+	if len(nodedevices) == 0 {
+		klog.InfoS("no gpu device found", "node", n.Name, "device annotation", devEncoded)
+		return []*device.DeviceInfo{}, errors.New("no gpu found on node")
 	}


When no GPU devices are found on a node, the code logs this at an Info level but returns an error. This is inconsistent. If an empty device list is considered an error, it should be logged as a Warning or Error. Additionally, the error message is generic. It would be more helpful to include the device type and node name in the error message.

Suggested change

if len(nodedevices) == 0 {

klog.InfoS("no gpu device found", "node", n.Name, "device annotation", devEncoded)

return []*device.DeviceInfo{}, errors.New("no gpu found on node")

}

if len(nodedevices) == 0 {

klog.WarningS("no gpu device found", "node", n.Name, "device annotation", devEncoded)

return []*device.DeviceInfo{}, fmt.Errorf("no %s gpus found on node %s", dev.config.ChipName, n.Name)

}

gemini-code-assist · 2025-10-11T03:45:38Z

pkg/scheduler/config/config.go

 	CambriconConfig cambricon.CambriconConfig `yaml:"cambricon"`
 	MthreadsConfig  mthreads.MthreadsConfig   `yaml:"mthreads"`
-	IluvatarConfig  iluvatar.IluvatarConfig   `yaml:"iluvatar"`
+	IluvatarConfig  []iluvatar.IluvatarConfig `yaml:"iluvatars"`


With the move to support multiple Iluvatar device types via the iluvatars configuration, the old singular iluvatar configuration is now obsolete. To avoid confusion for users and developers, the old iluvatar section should be removed from the default configuration string configMapdata (lines 292-295).

qiangwei1983 added 2 commits October 11, 2025 11:01

Improved support for Iulvatar GPUs

405256d

Merge branch 'master' of https://github.com/qiangwei1983/HAMi

2ddbce8

hami-robot bot added kind/feature new function dco-signoff: no labels Oct 11, 2025

hami-robot bot requested review from chaunceyjiang and wawa0210 October 11, 2025 03:43

github-actions bot removed the kind/feature new function label Oct 11, 2025

hami-robot bot added the size/XL label Oct 11, 2025

gemini-code-assist bot reviewed Oct 11, 2025

View reviewed changes

qiangwei1983 changed the title ~~Improved support for Iulvatar GPUs~~ Improved support for iluvatar GPUs Oct 11, 2025

		image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
		- bash

	4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu=1 > 1`) do not support vGPU mode.
	4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu > 1`) do not support vGPU mode.

Improved support for iluvatar GPUs #1399

Are you sure you want to change the base?

Improved support for iluvatar GPUs #1399

Conversation

qiangwei1983 commented Oct 11, 2025

Uh oh!

hami-robot bot commented Oct 11, 2025

Uh oh!

hami-robot bot commented Oct 11, 2025

Uh oh!

hami-robot bot commented Oct 11, 2025

Uh oh!

gemini-code-assist bot commented Oct 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant