Feature support_xpu #22836

hzdzkjdxyjs · 2025-11-28T18:18:06Z

I have read the CLA Document and I sign the CLA
This PR adds initial Intel XPU single-device training support to Ultralytics.
It is a minimal, safe, backward-compatible implementation that activates the XPU path only when the installed PyTorch build supports XPU.

🚀 Motivation

PyTorch ≥ 2.8.0 now provides official Intel XPU support.
Many users on Intel Arc / B60 / Flex want to train YOLO models without CUDA GPUs.

Basic Configuration

Operating System：ubuntu25.04
Kernel：6.14.0-1006-intel
GPU：Blue Ocean or MaxSun B60 Pro.，
In fact, the hardware vendor does not matter, because PyTorch is not tightly bound to the GPU model. The only real complexity is the driver installation. As long as your driver installation succeeds, everything should work normally.
Driver + Installation Guide：https://github.com/intel/llm-scaler/blob/main/vllm/README.md/#1-getting-started-and-usagexit
Driver Version：multi-arc-bmg-offline-installer-25.38.4.1

Environment Installation

Install the customized PyTorch build：https://pytorch-extension.intel.com/installation
Install the same PyTorch version as mine:

cd ultralytics
pip install -e .
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/xpu
# The following two are optional because multi-XPU training is not supported yet:
# pip install intel-extension-for-pytorch==2.8.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# pip install oneccl_bind_pt==2.8.0+xpu --index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Verify successful installation:

(B60) root@b60:~/ultralytics# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.version.xpu)
20250101
>>> print(torch.xpu.is_available())
True
>>> print(torch.xpu.get_device_name(0))
Intel(R) Graphics [0xe211]

Avoid runtime errors on environments where XPU is not supported.

to ensure that all XPU-specific logic is only executed when the installed PyTorch build actually supports Intel XPU.
If the user installs a PyTorch version without XPU support, the code will safely skip the XPU branch and fall back to existing CUDA/CPU logic without raising errors.
This improves compatibility across environments and prevents runtime failures on systems where torch.xpu is not compiled.

if hasattr(torch, "xpu") and torch.xpu.is_available():

ultralytics/ultralytics/utils/torch_utils.py/ get_gpu_info()
Purpose: Add XPU information parsing.

@functools.lru_cache
def get_gpu_info(index):
    """Return a string with system GPU information, i.e. 'Tesla T4, 15102MiB'."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
        properties = torch.xpu.get_device_properties(index)
        return f"{properties.name}, {properties.total_memory / (1 << 20):.0f}MiB"
    properties = torch.cuda.get_device_properties(index)
    return f"{properties.name}, {properties.total_memory / (1 << 20):.0f}MiB"

Test Case: Same as training code
Result: After modification, training outputs XPU device information:

Ultralytics 8.3.231 🚀 Python-3.10.19 torch-2.8.0+xpu XPU:0 (Intel(R) Graphics [0xe211]

ultralytics/ultralytics/utils/torch_utils.py/ time_sync()
Purpose: XPU synchronization

def time_sync():
    """Return PyTorch-accurate time."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
            torch.xpu.synchronize()
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

ultralytics/ultralytics/utils/torch_utils.py/ select_device()
XPU single-card selection support

    elif hasattr(torch, "xpu") and torch.xpu.is_available():
        if device.startswith("xpu"):
            index = int(device.split(":")[1]) if ":" in device else 0
        elif device in {"", "0"}:
            index = 0
        else:
            index = None
        if index is not None:
            if verbose:
                info = get_gpu_info(index)  
                s += f"XPU:{index} ({info})\n"  
                LOGGER.info(s if newline else s.rstrip())
            return torch.device("xpu", index)

from ultralytics import YOLO
model = YOLO("yolo11n.yaml")
model.train(
        data="coco128.yaml",
        epochs=50,
        imgsz=256,
        #device="xpu:0"
        #device="xpu:1"
        device="xpu")

ultralytics/ultralytics/utils/checks.py/ check_amp()
disable AMP on XPU

    if hasattr(torch, "xpu") and torch.xpu.is_available() and device.type == "xpu":
        LOGGER.warning(f"{prefix}Intel XPU detected. AMP is disabled (not supported on XPU).")
        return False

Result:

WARNING ⚠️ AMP: Intel XPU detected. AMP is disabled (not supported on XPU).

ultralytics/engine/trainer.py _get_memory()
Purpose: Use correct memory query on XPU

    def _get_memory(self, fraction=False):
        """Get accelerator memory utilization in GB or as a fraction of total memory."""
        memory, total = 0, 0
        if self.device.type == "mps":
            memory = torch.mps.driver_allocated_memory()
            if fraction:
                return __import__("psutil").virtual_memory().percent / 100
        elif self.device.type != "cpu" and hasattr(torch, "xpu") and torch.xpu.is_available():
            memory = torch.xpu.memory_allocated(self.device)
            total = torch.xpu.get_device_properties(self.device).total_memory
            return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)
        elif self.device.type != "cpu":
            memory = torch.cuda.memory_reserved()
            if fraction:
                total = torch.cuda.get_device_properties(self.device).total_memory
        return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)

(B60) root@b60:~/ultralytics# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ctrl click to launch VS Code Native REPL
>>> import torch
>>> torch.xpu.get_device_properties(0).total_memory
24385683456
>>> x = torch.randn((1024, 1024, 256), device="xpu")
>>> torch.xpu.memory_allocated(0)
1073741824

Warning

This is a warning message — please pay attention to the explanation here.

However, during actual execution, when I run yolov8x with an input size of 640, the memory usage is extremely low. But I believe there is no problem with my code.
So I suspect the issue may not come from my implementation but from the upstream logic or the runtime environment.

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
1/50      1.18G      3.651       5.77        4.3        162        640: 100% ━━━━━━━━━━━━ 8/8 3.4s/it 27.1s
Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 75% ━━━━━━━━━─── 3/4 4.0s/it 6.4s<4.0s

ultralytics/engine/trainer.py _clear_memory()
Purpose: support clear memory query on XPU

    def _clear_memory(self, threshold: float | None = None):
        """Clear accelerator memory by calling garbage collector and emptying cache."""
        if threshold:
            assert 0 <= threshold <= 1, "Threshold must be between 0 and 1."
            if self._get_memory(fraction=True) <= threshold:
                return
        gc.collect()
        if self.device.type == "mps":
            torch.mps.empty_cache()
        elif self.device.type == "cpu":
            return
        elif hasattr(torch, "xpu") and torch.xpu.is_available():
            torch.xpu.empty_cache()
        else:
            torch.cuda.empty_cache()

(B60) root@b60:~/ultralytics# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ctrl click to launch VS Code Native REPL
>>> import torch
>>> torch.xpu.empty_cache()
>>>

Test case

/root/anaconda3/envs/B60/bin/python -m pytest -s -q test.py

import pytest
import torch
from ultralytics import YOLO

pytestmark = pytest.mark.skipif(
    not hasattr(torch, "xpu") or not torch.xpu.is_available(),
    reason="XPU not available",
)

def test_yolo_xpu_forward():
    model = YOLO("yolo11n.pt") # 填入本地的模型
    model.to("xpu")
    x = torch.rand(1, 3, 64, 64, device="xpu")
    y = model.model(x)
    assert y is not None
    print("\n[XPU Test] YOLO XPU forward passed successfully ✔")

Test case results

(B60) root@b60:~/ultralytics# /root/anaconda3/envs/B60/bin/python -m pytest -s -q test.py

[XPU Test] YOLO XPU forward passed successfully ✔
.
=============================================================== slowest 30 durations ================================================================
1.20s call     test.py::test_yolo_xpu_forward

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
1 passed in 28.89s

XPU support testing
Create a training file in the current directory and modify the device parameter to:
- xpu
- xpu:0
- xpu:1

from ultralytics import YOLO
model = YOLO("yolo10n.yaml")
model.train(
        data="coco128.yaml",
        epochs=50,
        imgsz=256,
        device="xpu:0")

For long-duration stability and stress testing, I increased the training schedule to 50 epochs.
Unfortunately, when training entirely from scratch using the YAML configuration, the model performance is not ideal.
That said, I believe the ecosystem should not remain limited to a single NVIDIA GPU. Therefore, our priority should be to complete the framework-level adaptation first.
Operator-level optimization can come afterward — and at that stage, we will need stronger support and collaboration from the Intel team.
When training using pretrained weights only, the results are noticeably better.

The following results are obtained using partially pretrained weights.

epoch	time	train/box_loss	precision(B)	mAP50(B)	val/box_loss
1	16.677	1.57275	0.61519	0.44667	1.1658
2	19.09	1.55932	0.58063	0.44085	1.17004
3	21.5743	1.58362	0.57267	0.44635	1.17207
4	23.9711	1.59142	0.55874	0.44612	1.17544
5	26.2329	1.5043	0.60292	0.44695	1.1745
6	28.9009	1.46778	0.59566	0.45138	1.17976
7	31.3829	1.48921	0.6531	0.45758	1.18727
8	33.6152	1.45977	0.65791	0.47068	1.17514
9	36.1311	1.49054	0.63523	0.47322	1.17033
10	38.761	1.36505	0.69117	0.47324	1.1707
11	41.2902	1.35502	0.71326	0.48303	1.16648
12	43.8088	1.3514	0.66831	0.48799	1.16152
13	46.4099	1.332	0.68353	0.49374	1.14584
14	48.9732	1.37742	0.71426	0.49713	1.13077
15	51.5884	1.38346	0.7076	0.49745	1.13032
...	...	...	...	...	...
41	119.006	1.12213	0.71739	0.60639	0.99312
42	121.317	1.15439	0.72032	0.60901	0.99179
43	123.502	1.09189	0.72232	0.61391	0.98833
44	126.063	1.11224	0.81704	0.61871	0.98884
45	128.791	1.18574	0.81506	0.61953	0.99165
46	131.331	1.07524	0.80147	0.62052	0.99334
47	133.926	1.0565	0.79649	0.62348	0.99335
48	136.49	1.09021	0.79817	0.6232	0.99135
49	138.96	1.09611	0.78176	0.6221	0.99239
50	141.386	1.08103	0.79039	0.61891	0.99291

⚠️ Why this PR supports single-XPU only

This limitation is intentional.
- Ultralytics’ training pipeline currently assumes CUDA semantics:
- device list comes from CUDA_VISIBLE_DEVICES
- DDP initialization infers world_size from CUDA-style strings (“0,1,2”)
- backend init tightly couples CUDA → NCCL
- Multi-XPU requires:
- backend abstraction
- device parser refactor
- optional oneccl initialization
- distributed launch redesign
To keep this PR minimal, safe and upstream-ready, only single-XPU support is implemented.
Multi-XPU can be added later after structural refactoring.

⸻

✅ Summary

This PR provides:

Full single-device Intel XPU support
Zero regression for CUDA, CPU, and MPS
Correct device selection, info, sync, and memory reporting
Clean, minimal, backwards-compatible patch
Verified by both forward and long-duration training tests

It significantly broadens Ultralytics’ ecosystem beyond CUDA-only hardware.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Adds initial Intel XPU support across device selection, memory management, timing, AMP checks, and testing for YOLO11.

📊 Key Changes

➕ Introduces tests/xpu_test.py to validate YOLO11 forward pass on Intel XPU (model.to("xpu")).
🧠 Extends Trainer._get_memory and _clear_memory to correctly report and clear memory on Intel XPU devices.
⚙️ Updates check_amp to detect Intel XPU and explicitly disable AMP on XPU with a clear log warning.
💻 Enhances get_gpu_info to return Intel XPU device name and memory when XPU is available.
🎯 Updates select_device to recognize xpu targets, log XPU device info, and return a proper torch.device("xpu", index).
⏱️ Modifies time_sync to synchronize Intel XPU before timing when available.

🎯 Purpose & Impact

✅ Enables running YOLO11 models on Intel XPU devices with proper device selection and memory handling.
🧪 Improves reliability by adding a dedicated XPU test to ensure forward passes work on Intel hardware.
🔍 Provides clearer logging and behavior for AMP usage on XPU, preventing unsupported configurations.
🚀 Broadens hardware support, making Ultralytics models more accessible to users with Intel XPU accelerators.

UltralyticsAssistant · 2025-11-28T18:19:12Z

👋 Hello @hzdzkjdxyjs, thank you for submitting a ultralytics/ultralytics 🚀 PR! This is an automated review assistant, and a Ultralytics engineer will be by shortly to help further. To ensure a seamless integration of your work, please review the following checklist:

✅ Define a Purpose: Clearly explain the purpose of your Intel XPU support and related changes in your PR description, and link to any relevant issues. Ensure your commit messages are clear, concise, and adhere to the project's conventions.
✅ Synchronize with Source: Confirm your PR is synchronized with the ultralytics/ultralytics main branch. If it's behind, update it by clicking the Update branch button or by running git pull and git merge main locally.
✅ Ensure CI Checks Pass: Verify all Ultralytics Continuous Integration (CI) checks are passing. If any checks fail (including new XPU tests), please address the issues.
✅ Update Documentation: Update the relevant documentation for any new or modified features, especially around Intel XPU support and device selection (e.g., device=xpu usage).
✅ Add Tests: If applicable, include or update tests to cover your changes, and confirm that all tests are passing locally and in CI. You’ve added an XPU test already—please ensure it behaves correctly in environments without XPU (e.g., via skips).
✅ Sign the CLA: Please ensure you have signed our Contributor License Agreement if this is your first Ultralytics PR by writing I have read the CLA Document and I sign the CLA in a new message.
✅ Minimize Changes: Limit your changes to the minimum necessary for your feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀

UltralyticsAssistant

🔍 PR Review

_{Made with ❤️ by Ultralytics Actions}

Please fix the XPU selection/logging path so CUDA requests are not intercepted, restore get_gpu_info() to report the correct backend, and ensure _clear_memory() always empties XPU caches when training on XPU devices.

💬 Posted 3 inline comments

UltralyticsAssistant · 2025-11-28T18:20:07Z

ultralytics/utils/torch_utils.py

    mps = device in {"mps", "mps:0"}  # Apple Metal Performance Shaders (MPS)
    if cpu or mps:
        os.environ["CUDA_VISIBLE_DEVICES"] = ""  # force torch.cuda.is_available() = False
+    elif hasattr(torch, "xpu") and torch.xpu.is_available():


❗ CRITICAL: This elif hasattr(torch, "xpu")... block runs any time XPU is available, even when the caller explicitly requested another backend such as device="cuda:0". For those requests index stays None, so the CUDA path below is never reached and the function falls through without returning the desired CUDA device (often defaulting to CPU). Please gate this branch so it only executes when the requested target is actually XPU (or no device was specified) and allow the existing CUDA logic to handle CUDA inputs.

UltralyticsAssistant · 2025-11-28T18:20:08Z

ultralytics/utils/torch_utils.py

 @functools.lru_cache
 def get_gpu_info(index):
    """Return a string with system GPU information, i.e. 'Tesla T4, 15102MiB'."""
+    if hasattr(torch, "xpu") and torch.xpu.is_available():


⚠️ HIGH: get_gpu_info() now unconditionally returns XPU properties whenever torch.xpu.is_available(), so CUDA calls (which still invoke this helper) will try to index XPU devices and either log the wrong hardware or raise when CUDA and XPU device counts differ. Only use the XPU path when the caller is querying an XPU device (or when CUDA is unavailable) and keep the CUDA branch as the default for CUDA indices.

UltralyticsAssistant · 2025-11-28T18:20:08Z

ultralytics/engine/trainer.py

            torch.mps.empty_cache()
        elif self.device.type == "cpu":
            return
+        elif hasattr(torch, "xpu") and torch.xpu.is_available():


💡 MEDIUM: torch.xpu.empty_cache() is only executed when CUDA is completely unavailable. On systems that expose both CUDA and XPU, running on XPU will skip cache clearing and leak memory because the code falls into the CUDA branch instead. Tie this branch to self.device.type == "xpu" (similar to the MPS branch) so XPU memory is cleared irrespective of CUDA availability.

hzdzkjdxyjs · 2025-11-29T00:27:46Z

关于cuda和xpu冲突问题，我认为这个问题不会有影响，因为在一个环境中你无法同时安装cuda或者xpu的torch，所以对于他们的代码设计来说，不需要考虑冲突问题

glenn-jocher · 2025-11-29T03:25:30Z

Thanks for the very detailed PR and extra note on the CUDA/XPU interaction.

You’re right that with the current Intel wheels you typically don’t get CUDA and XPU in the same environment, but for long‑term robustness it would still be better if the XPU paths only triggered when the selected device is actually XPU, rather than just when torch.xpu.is_available(). For example, in _get_memory and _clear_memory we can key off self.device.type so a hypothetical future build with both backends can’t accidentally route CUDA runs through XPU helpers:

def _get_memory(self, fraction=False):
    memory, total = 0, 0
    if self.device.type == "mps":
        ...
    elif self.device.type == "xpu":
        memory = torch.xpu.memory_allocated(self.device)
        total = torch.xpu.get_device_properties(self.device).total_memory
        return (memory / total if total > 0 else 0) if fraction else (memory / 2**30)
    elif self.device.type != "cpu":
        memory = torch.cuda.memory_reserved()
        if fraction:
            total = torch.cuda.get_device_properties(self.device).total_memory
    return (memory / total if total > 0 else 0) if fraction else (memory / 2**30)

and similarly in _clear_memory only call torch.xpu.empty_cache() when self.device.type == "xpu". In select_device, keeping XPU mapping behind explicit device strings like xpu / xpu:0 and leaving bare numeric strings ("0", "0,1", etc.) for CUDA will also avoid surprises if a mixed backend ever appears.

If you can update those pieces along these lines, the rest of the changes look like a good, minimal first step for single‑XPU support and we can continue the detailed review in this PR.

codecov · 2025-11-29T04:10:29Z

Codecov Report

❌ Patch coverage is 25.80645% with 23 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
ultralytics/utils/torch_utils.py	20.00%	16 Missing ⚠️
ultralytics/engine/trainer.py	16.66%	5 Missing ⚠️
ultralytics/utils/checks.py	60.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

hzdzkjdxyjs · 2025-11-29T05:06:33Z

为了能够更好的支持您所说的情况我做了以下修改来使框架鲁棒性更强

1.Added standalone support for XPU device information queries, separating it from get_gpu_info, and extended system information reporting to include XPU details

Added XPU device information retrieval in ultralytics/ultralytics/utils/checks.py collect_system_info()
Added a new get_xpu_info ultralytics/ultralytics/utils/checks.py collect_system_info():

@functools.lru_cache
def get_xpu_info(index):
    """Return a string with system GPU information, i.e. 'Tesla T4, 15102MiB'."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
        properties = torch.xpu.get_device_properties(index)
        return f"{properties.name}, {properties.total_memory / (1 << 20):.0f}MiB"

def collect_system_info():
    """Collect and print relevant system information including OS, Python, RAM, CPU, and CUDA.

    Returns:
        (dict): Dictionary containing system information.
    """
    import psutil  # scoped as slow import

    from ultralytics.utils import ENVIRONMENT  # scope to avoid circular import
    from ultralytics.utils.torch_utils import get_cpu_info, get_gpu_info, get_xpu_info

    gib = 1 << 30  # bytes per GiB
    cuda = torch.cuda.is_available()
    xpu = hasattr(torch, "xpu") and torch.xpu.is_available() 
    check_yolo()
    total, _used, free = shutil.disk_usage("/")

    info_dict = {
        "OS": platform.platform(),
        "Environment": ENVIRONMENT,
        "Python": PYTHON_VERSION,
        "Install": "git" if GIT.is_repo else "pip" if IS_PIP_PACKAGE else "other",
        "Path": str(ROOT),
        "RAM": f"{psutil.virtual_memory().total / gib:.2f} GB",
        "Disk": f"{(total - free) / gib:.1f}/{total / gib:.1f} GB",
        "CPU": get_cpu_info(),
        "CPU count": os.cpu_count(),
        "GPU": get_gpu_info(index=0) if cuda else None,
        "XPU": get_xpu_info() if xpu else None,
        "GPU count": torch.cuda.device_count() if cuda else None,
        "XPU count": torch.xpu.device_count() if xpu else None,
        "CUDA": torch.version.cuda if cuda else None,
    }

to verify that this API is present and operational

(base) root@b60:~# conda activate B60
(B60) root@b60:~# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.xpu.device_count()
2

2.Ensure that only explicit xpu device strings (e.g., xpu, xpu:0) enter this branch, preventing numeric CUDA-style indices such as "0" or "1" from being interpreted as XPU devices. The XPU device mapping is preserved strictly as xpu:index.

To verify that the model is correctly running on the XPU device,(e.g., xpu,:1) I used:

watch -n 0.5 xpu-smi stats -d 1 -j

Training is running on xpu:1.

    elif device.startswith("xpu"):  # Intel XPU
        parts = device.split(":")
        index = int(parts[1]) if len(parts) > 1 else 0
        if verbose:
            info = get_xpu_info(index)
            s += f"XPU:{index} ({info})\n"
            LOGGER.info(s if newline else s.rstrip())
        return torch.device("xpu", index)

3.device-type checks are used when performing memory queries and related operations

/ultralytics/ultralytics/engine/trainer.py _get_memory()/_clear_memory()

    def _get_memory(self, fraction=False):
        """Get accelerator memory utilization in GB or as a fraction of total memory."""
        memory, total = 0, 0
        if self.device.type == "mps":
            memory = torch.mps.driver_allocated_memory()
            if fraction:
                return __import__("psutil").virtual_memory().percent / 100
        elif self.device.type != "cpu" and self.device.type == "xpu":
            memory = torch.xpu.memory_allocated(self.device)
            total = torch.xpu.get_device_properties(self.device).total_memory
            return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)
        elif self.device.type != "cpu":
            memory = torch.cuda.memory_reserved()
            if fraction:
                total = torch.cuda.get_device_properties(self.device).total_memory
        return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)
    def _clear_memory(self, threshold: float | None = None):
        """Clear accelerator memory by calling garbage collector and emptying cache."""
        if threshold:
            assert 0 <= threshold <= 1, "Threshold must be between 0 and 1."
            if self._get_memory(fraction=True) <= threshold:
                return
        gc.collect()
        if self.device.type == "mps":
            torch.mps.empty_cache()
        elif self.device.type == "cpu":
            return
        elif self.device.type == "xpu":
            torch.xpu.empty_cache()
        else:
            torch.cuda.empty_cache()

4.renamed to test_xpu to comply with the test script naming conventions

ultralytics/tests/test_xpu.py

5.Lastly, I added an example in the Markdown file demonstrating how to train a model. You may update it on the website to make it easier for users to follow.

Specifying device=xpu will automatically run on xpu:0.

yolo train model=yolo11n.pt data=coco128.yaml epochs=50 imgsz=256 device=xpu

If you want to run on a different device, you can explicitly specify:

device=xpu:1
device=xpu:2

hzdzkjdxyjs · 2025-11-29T08:29:38Z

如果想要进一步支持多卡训练，intel的工程师是这样和我说的，以下这几个函数是支持多卡的最重要的函数

vllm/vllm/platforms/init.py xpu_platform_plugin()
vllm/vllm/utils/torch_utils.py

def xpu_platform_plugin() -> str | None:
    is_xpu = False
    logger.debug("Checking if XPU platform is available.")
    try:
        # installed IPEX if the machine has XPUs.
        import intel_extension_for_pytorch  # noqa: F401
        import torch

        if supports_xccl():
            dist_backend = "xccl"
        else:
            dist_backend = "ccl"
            import oneccl_bindings_for_pytorch  # noqa: F401

        if hasattr(torch, "xpu") and torch.xpu.is_available():
            is_xpu = True
            from vllm.platforms.xpu import XPUPlatform

            XPUPlatform.dist_backend = dist_backend
            logger.debug("Confirmed %s backend is available.", XPUPlatform.dist_backend)
            logger.debug("Confirmed XPU platform is available.")
    except Exception as e:
        logger.debug("XPU platform is not available because: %s", str(e))

    return "vllm.platforms.xpu.XPUPlatform" if is_xpu else None

def supports_xccl() -> bool:
    return (
        is_torch_equal_or_newer("2.8.0.dev") and torch.distributed.is_xccl_available()
    )

他的xccl和ccl两个dist
要导入对应的包才能支持注册，因为native不在包中
其他具体的我也在研究，我目前能跑通多卡的案例就是这个，intel的团队暂时也没有对应的训练支持，也许我们走在了他们前面
现在我能够提供的信息就是那么多
当然还有一点比较重要intel有一个参数可以控制oneAPI来底层控制显卡，但不能控制torch
ZE_AFFINITY_MASK= index，index1···
https://github.com/hzdzkjdxyjs/How-to-use-llamafactory-at-B60Pro/blob/main/Muliti_GPU_Train.md

现在我的想法是我想研究一下为什么llamafactory可以支持多卡训练，可能这是关键，但是llamafactory似乎本身不支持多卡训练，支持多卡训练的是背后的pip包比如transformer

glenn-jocher · 2025-11-30T02:43:33Z

Nice follow‑up, the XPU‑specific helpers and stricter device handling make the design more robust, and the renamed test_xpu looks good for targeted coverage.

A couple of small tweaks will help clean things up before we go deeper into review:

It would be better to keep get_xpu_info alongside get_gpu_info in ultralytics/utils/torch_utils.py and import it into collect_system_info(), and to give it a default index so your call in collect_system_info works without arguments, for example:

@functools.lru_cache
def get_xpu_info(index: int = 0):
    """Return a string with system XPU information, i.e. 'Intel(R) Graphics..., 15102MiB'."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
        props = torch.xpu.get_device_properties(index)
        return f"{props.name}, {props.total_memory / (1 << 20):.0f}MiB"
    return None

and in collect_system_info() something like:

"XPU": get_xpu_info(0) if xpu else None,
"XPU count": torch.xpu.device_count() if xpu else None,

In _get_memory, the condition elif self.device.type != "cpu" and self.device.type == "xpu": can just be elif self.device.type == "xpu": to avoid redundant checks and keep the flow clearly mps → xpu → cuda.

The multi‑XPU notes and the xpu_platform_plugin / xccl hints are very helpful context, but as you said they imply a larger refactor (backend abstraction, DDP launch, optional IPEX/oneCCL integration), so it would be best to keep this PR strictly single‑XPU and minimal, and treat multi‑XPU as a follow‑up design/PR once this path is stable.

The CLI device=xpu example is also useful; once XPU support is merged we can look at adding a short Intel XPU section to the training docs on Ultralytics documentation so users can discover it easily.

hzdzkjdxyjs · 2025-11-30T03:12:04Z

非常感谢您耐心的指导，确实这样写代码逻辑会更严谨，我按照您的要求已经做了修改，希望有机会能和您的团队一起尝试进行多卡训练的修改，我也会继续探索相关的研究。

hzdzkjdxyjs · 2025-11-30T18:44:31Z

First, I want to express my sincere gratitude — thank you for patiently guiding me through the revisions.
Also, this is my very first PR, so I’d like to ask: approximately how long does it usually take for a PR to be reviewed and merged?
I’m really happy to contribute to this open-source project and excited to see my work being helpful.

hzdzkjdxyjs · 2025-12-01T03:12:09Z

考虑到本xpu是单卡训练，那么会有两种情况，第一种情况是使用多卡训练，在本函数中，我添加了多卡识别，通过字符“，”识别，第二个是超过xpu的索引告警，此告警由torch执行

/ultralytics/ultralytics/utils/torch_utils.py select_device()

    elif device.startswith("xpu"):  # Intel XPU
        index_str = device.split(":", 1)[1] if ":" in device else "0"
        if "," in index_str:
            msg = f"Invalid XPU 'device={device}' requested. Use a single index 0-15."
            LOGGER.warning(msg)
            raise ValueError(msg)
        index = int(index_str)
        if verbose:
            info = get_xpu_info(index)
            s += f"XPU:{index} ({info})\n"
            LOGGER.info(s if newline else s.rstrip())
        return torch.device("xpu", index)

原来的函数如果使用多卡训练时的情况

(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:0,1
Traceback (most recent call last):
  File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
    sys.exit(entrypoint())
  File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
    self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
  File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
    super().__init__(cfg, overrides, _callbacks)
  File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
    self.device = select_device(self.args.device)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 198, in select_device
    index = int(index_str)
ValueError: invalid literal for int() with base 10: '0,1'

现在的函数超过单卡会由本函数报警

(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:0,1
WARNING ⚠️ Invalid XPU 'device=xpu:0,1' requested. Only a single XPU device is supported.
Traceback (most recent call last):
  File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
    sys.exit(entrypoint())
  File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
    self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
  File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
    super().__init__(cfg, overrides, _callbacks)
  File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
    self.device = select_device(self.args.device)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 198, in select_device
    raise ValueError(msg)
ValueError: Invalid XPU 'device=xpu:0,1' requested. Only a single XPU device is supported.

intel最新的蓝戟B60Pro双芯显卡最多在一台服务器上能插16张，对于非整数和超过限制的情况均会报警
超过最大index会由torch报警

(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:16
Traceback (most recent call last):
  File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
    sys.exit(entrypoint())
  File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
    self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
  File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
    super().__init__(cfg, overrides, _callbacks)
  File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
    self.device = select_device(self.args.device)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 206, in select_device
    info = get_xpu_info(index)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 137, in get_xpu_info
    properties = torch.xpu.get_device_properties(index)
  File "/root/anaconda3/envs/B60/lib/python3.10/site-packages/torch/xpu/__init__.py", line 262, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]  # noqa: F821
RuntimeError: The device index is out of range. It must be in [0, 2), but got 16.

glenn-jocher · 2025-12-01T18:52:49Z

Thanks again for all the careful updates—your latest changes around select_device (explicitly rejecting device=xpu:0,1 and keeping XPU strictly single‑device) plus the XPU‑specific helpers are well aligned with the scope we discussed and will make misconfigurations much clearer for users.

On review/merge timing, we can’t promise a specific timeframe, but this PR is now in a good shape for further maintainer review and we’ll continue any follow‑up discussion here. For multi‑XPU, treating it as a separate design/PR once this single‑XPU path is stable is exactly the right direction, and the notes you’ve gathered about Intel’s backend (xccl/ccl, affinity masks, etc.) will be very useful when we explore that; thanks for helping push the YOLO ecosystem onto more hardware for the whole community.

hzdzkjdxyjs · 2025-12-02T00:08:25Z

boss 我更新了新的多xpu 训练框架目前已经完成#22850

glenn-jocher · 2025-12-02T09:56:23Z

Nice, thanks for splitting multi‑XPU into a separate PR—keeping #22836 focused on single‑XPU support and handling multi‑XPU in #22850 is exactly what we need; we’ll continue any multi‑XPU discussion and review on that new PR.

hzdzkjdxyjs and others added 5 commits November 29, 2025 00:50

Add feature support xpu

37b0c30

Add feature support xpu

bb18318

Auto-format by https://ultralytics.com/actions

23c6655

local changes for feature-xpu

b71dd49

local changes for feature-xpu

e466b41

UltralyticsAssistant added enhancement New feature or request python Pull requests that update python code labels Nov 28, 2025

UltralyticsAssistant reviewed Nov 28, 2025

View reviewed changes

local changes for feature-xpu

ed68ccf

hzdzkjdxyjs force-pushed the feature-xpu branch from 373bd71 to ed68ccf Compare November 29, 2025 04:09

local changes for feature-xpu

a85a9df

hzdzkjdxyjs force-pushed the feature-xpu branch from 3176256 to fce4605 Compare November 29, 2025 04:50

local changes for feature-xpu

fce4605

hzdzkjdxyjs force-pushed the feature-xpu branch from f819fdd to 7d00399 Compare November 30, 2025 03:05

hzdzkjdxyjs and others added 2 commits November 30, 2025 11:06

local changes for feature-xpu

7d00399

Auto-format by https://ultralytics.com/actions

40a5b14

hzdzkjdxyjs added 2 commits November 30, 2025 22:58

Merge branch 'main' into feature-xpu

4feb03a

Merge branch 'main' into feature-xpu

9d7156c

hzdzkjdxyjs added 3 commits December 1, 2025 10:42

fix: validate xpu device selection to single index 0-15

be8167e

fix: validate xpu device selection to single index

7d3d7bf

fix: validate xpu device selection to single index

6224bce

Uh oh!

Feature support_xpu #22836

Are you sure you want to change the base?

Feature support_xpu #22836

Conversation

hzdzkjdxyjs commented Nov 28, 2025 • edited by UltralyticsAssistant Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Motivation

Basic Configuration

Environment Installation

Avoid runtime errors on environments where XPU is not supported.

Test case

The following results are obtained using partially pretrained weights.

⚠️ Why this PR supports single-XPU only

✅ Summary

This PR provides:

It significantly broadens Ultralytics’ ecosystem beyond CUDA-only hardware.

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

Uh oh!

UltralyticsAssistant commented Nov 28, 2025

Uh oh!

UltralyticsAssistant left a comment

Choose a reason for hiding this comment

🔍 PR Review

Uh oh!

UltralyticsAssistant Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

hzdzkjdxyjs Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

hzdzkjdxyjs Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

UltralyticsAssistant Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

hzdzkjdxyjs Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

UltralyticsAssistant Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

hzdzkjdxyjs Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

hzdzkjdxyjs commented Nov 29, 2025

Uh oh!

glenn-jocher commented Nov 29, 2025

Uh oh!

codecov bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hzdzkjdxyjs commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1.Added standalone support for XPU device information queries, separating it from get_gpu_info, and extended system information reporting to include XPU details

2.Ensure that only explicit xpu device strings (e.g., xpu, xpu:0) enter this branch, preventing numeric CUDA-style indices such as "0" or "1" from being interpreted as XPU devices. The XPU device mapping is preserved strictly as xpu:index.

3.device-type checks are used when performing memory queries and related operations

4.renamed to test_xpu to comply with the test script naming conventions

5.Lastly, I added an example in the Markdown file demonstrating how to train a model. You may update it on the website to make it easier for users to follow.

Uh oh!

hzdzkjdxyjs commented Nov 29, 2025

如果想要进一步支持多卡训练，intel的工程师是这样和我说的，以下这几个函数是支持多卡的最重要的函数

现在我的想法是我想研究一下为什么llamafactory可以支持多卡训练，可能这是关键，但是llamafactory似乎本身不支持多卡训练，支持多卡训练的是背后的pip包比如transformer

Uh oh!

glenn-jocher commented Nov 30, 2025

Uh oh!

hzdzkjdxyjs commented Nov 30, 2025

Uh oh!

hzdzkjdxyjs commented Nov 30, 2025

Uh oh!

hzdzkjdxyjs commented Dec 1, 2025

考虑到本xpu是单卡训练，那么会有两种情况，第一种情况是使用多卡训练，在本函数中，我添加了多卡识别，通过字符“，”识别，第二个是超过xpu的索引告警，此告警由torch执行

Uh oh!

glenn-jocher commented Dec 1, 2025

Uh oh!

hzdzkjdxyjs commented Nov 28, 2025 •

edited by UltralyticsAssistant

Loading

codecov bot commented Nov 29, 2025 •

edited

Loading

hzdzkjdxyjs commented Nov 29, 2025 •

edited

Loading