Skip to content

Conversation

@hzdzkjdxyjs
Copy link

@hzdzkjdxyjs hzdzkjdxyjs commented Nov 28, 2025

  • I have read the CLA Document and I sign the CLA

  • This PR adds initial Intel XPU single-device training support to Ultralytics.

  • It is a minimal, safe, backward-compatible implementation that activates the XPU path only when the installed PyTorch build supports XPU.

🚀 Motivation

  • PyTorch ≥ 2.8.0 now provides official Intel XPU support.
  • Many users on Intel Arc / B60 / Flex want to train YOLO models without CUDA GPUs.

Basic Configuration

  • Operating System:ubuntu25.04
  • Kernel:6.14.0-1006-intel
  • GPU:Blue Ocean or MaxSun B60 Pro.,
    In fact, the hardware vendor does not matter, because PyTorch is not tightly bound to the GPU model. The only real complexity is the driver installation. As long as your driver installation succeeds, everything should work normally.
  • Driver + Installation Guide:https://github.com/intel/llm-scaler/blob/main/vllm/README.md/#1-getting-started-and-usagexit
  • Driver Version:multi-arc-bmg-offline-installer-25.38.4.1

Environment Installation

cd ultralytics
pip install -e .
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/xpu
# The following two are optional because multi-XPU training is not supported yet:
# pip install intel-extension-for-pytorch==2.8.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# pip install oneccl_bind_pt==2.8.0+xpu --index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
  • Verify successful installation:
(B60) root@b60:~/ultralytics# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.version.xpu)
20250101
>>> print(torch.xpu.is_available())
True
>>> print(torch.xpu.get_device_name(0))
Intel(R) Graphics [0xe211]

Avoid runtime errors on environments where XPU is not supported.

  • to ensure that all XPU-specific logic is only executed when the installed PyTorch build actually supports Intel XPU.
  • If the user installs a PyTorch version without XPU support, the code will safely skip the XPU branch and fall back to existing CUDA/CPU logic without raising errors.
  • This improves compatibility across environments and prevents runtime failures on systems where torch.xpu is not compiled.
if hasattr(torch, "xpu") and torch.xpu.is_available():

  • ultralytics/ultralytics/utils/torch_utils.py/ get_gpu_info()
  • Purpose: Add XPU information parsing.
@functools.lru_cache
def get_gpu_info(index):
    """Return a string with system GPU information, i.e. 'Tesla T4, 15102MiB'."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
        properties = torch.xpu.get_device_properties(index)
        return f"{properties.name}, {properties.total_memory / (1 << 20):.0f}MiB"
    properties = torch.cuda.get_device_properties(index)
    return f"{properties.name}, {properties.total_memory / (1 << 20):.0f}MiB"
  • Test Case: Same as training code
  • Result: After modification, training outputs XPU device information:
Ultralytics 8.3.231 🚀 Python-3.10.19 torch-2.8.0+xpu XPU:0 (Intel(R) Graphics [0xe211]

  • ultralytics/ultralytics/utils/torch_utils.py/ time_sync()
  • Purpose: XPU synchronization
def time_sync():
    """Return PyTorch-accurate time."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
            torch.xpu.synchronize()
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

  • ultralytics/ultralytics/utils/torch_utils.py/ select_device()
  • XPU single-card selection support
    elif hasattr(torch, "xpu") and torch.xpu.is_available():
        if device.startswith("xpu"):
            index = int(device.split(":")[1]) if ":" in device else 0
        elif device in {"", "0"}:
            index = 0
        else:
            index = None
        if index is not None:
            if verbose:
                info = get_gpu_info(index)  
                s += f"XPU:{index} ({info})\n"  
                LOGGER.info(s if newline else s.rstrip())
            return torch.device("xpu", index)
from ultralytics import YOLO
model = YOLO("yolo11n.yaml")
model.train(
        data="coco128.yaml",
        epochs=50,
        imgsz=256,
        #device="xpu:0"
        #device="xpu:1"
        device="xpu")

  • ultralytics/ultralytics/utils/checks.py/ check_amp()
  • disable AMP on XPU
    if hasattr(torch, "xpu") and torch.xpu.is_available() and device.type == "xpu":
        LOGGER.warning(f"{prefix}Intel XPU detected. AMP is disabled (not supported on XPU).")
        return False
  • Result:
WARNING ⚠️ AMP: Intel XPU detected. AMP is disabled (not supported on XPU).

  • ultralytics/engine/trainer.py _get_memory()
  • Purpose: Use correct memory query on XPU
    def _get_memory(self, fraction=False):
        """Get accelerator memory utilization in GB or as a fraction of total memory."""
        memory, total = 0, 0
        if self.device.type == "mps":
            memory = torch.mps.driver_allocated_memory()
            if fraction:
                return __import__("psutil").virtual_memory().percent / 100
        elif self.device.type != "cpu" and hasattr(torch, "xpu") and torch.xpu.is_available():
            memory = torch.xpu.memory_allocated(self.device)
            total = torch.xpu.get_device_properties(self.device).total_memory
            return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)
        elif self.device.type != "cpu":
            memory = torch.cuda.memory_reserved()
            if fraction:
                total = torch.cuda.get_device_properties(self.device).total_memory
        return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)
(B60) root@b60:~/ultralytics# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ctrl click to launch VS Code Native REPL
>>> import torch
>>> torch.xpu.get_device_properties(0).total_memory
24385683456
>>> x = torch.randn((1024, 1024, 256), device="xpu")
>>> torch.xpu.memory_allocated(0)
1073741824

Warning

This is a warning message — please pay attention to the explanation here.

  • However, during actual execution, when I run yolov8x with an input size of 640, the memory usage is extremely low. But I believe there is no problem with my code.
  • So I suspect the issue may not come from my implementation but from the upstream logic or the runtime environment.
Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
1/50      1.18G      3.651       5.77        4.3        162        640: 100% ━━━━━━━━━━━━ 8/8 3.4s/it 27.1s
Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 75% ━━━━━━━━━─── 3/4 4.0s/it 6.4s<4.0s

  • ultralytics/engine/trainer.py _clear_memory()
  • Purpose: support clear memory query on XPU
    def _clear_memory(self, threshold: float | None = None):
        """Clear accelerator memory by calling garbage collector and emptying cache."""
        if threshold:
            assert 0 <= threshold <= 1, "Threshold must be between 0 and 1."
            if self._get_memory(fraction=True) <= threshold:
                return
        gc.collect()
        if self.device.type == "mps":
            torch.mps.empty_cache()
        elif self.device.type == "cpu":
            return
        elif hasattr(torch, "xpu") and torch.xpu.is_available():
            torch.xpu.empty_cache()
        else:
            torch.cuda.empty_cache()
(B60) root@b60:~/ultralytics# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ctrl click to launch VS Code Native REPL
>>> import torch
>>> torch.xpu.empty_cache()
>>> 

Test case

/root/anaconda3/envs/B60/bin/python -m pytest -s -q test.py
import pytest
import torch
from ultralytics import YOLO

pytestmark = pytest.mark.skipif(
    not hasattr(torch, "xpu") or not torch.xpu.is_available(),
    reason="XPU not available",
)

def test_yolo_xpu_forward():
    model = YOLO("yolo11n.pt") # 填入本地的模型
    model.to("xpu")
    x = torch.rand(1, 3, 64, 64, device="xpu")
    y = model.model(x)
    assert y is not None
    print("\n[XPU Test] YOLO XPU forward passed successfully ✔")
  • Test case results
(B60) root@b60:~/ultralytics# /root/anaconda3/envs/B60/bin/python -m pytest -s -q test.py

[XPU Test] YOLO XPU forward passed successfully ✔
.
=============================================================== slowest 30 durations ================================================================
1.20s call     test.py::test_yolo_xpu_forward

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
1 passed in 28.89s

  • XPU support testing
  • Create a training file in the current directory and modify the device parameter to:
    • xpu
    • xpu:0
    • xpu:1
from ultralytics import YOLO
model = YOLO("yolo10n.yaml")
model.train(
        data="coco128.yaml",
        epochs=50,
        imgsz=256,
        device="xpu:0")
  • For long-duration stability and stress testing, I increased the training schedule to 50 epochs.

  • Unfortunately, when training entirely from scratch using the YAML configuration, the model performance is not ideal.

  • That said, I believe the ecosystem should not remain limited to a single NVIDIA GPU. Therefore, our priority should be to complete the framework-level adaptation first.

  • Operator-level optimization can come afterward — and at that stage, we will need stronger support and collaboration from the Intel team.

  • When training using pretrained weights only, the results are noticeably better.

The following results are obtained using partially pretrained weights.

epoch time train/box_loss precision(B) mAP50(B) val/box_loss
1 16.677 1.57275 0.61519 0.44667 1.1658
2 19.09 1.55932 0.58063 0.44085 1.17004
3 21.5743 1.58362 0.57267 0.44635 1.17207
4 23.9711 1.59142 0.55874 0.44612 1.17544
5 26.2329 1.5043 0.60292 0.44695 1.1745
6 28.9009 1.46778 0.59566 0.45138 1.17976
7 31.3829 1.48921 0.6531 0.45758 1.18727
8 33.6152 1.45977 0.65791 0.47068 1.17514
9 36.1311 1.49054 0.63523 0.47322 1.17033
10 38.761 1.36505 0.69117 0.47324 1.1707
11 41.2902 1.35502 0.71326 0.48303 1.16648
12 43.8088 1.3514 0.66831 0.48799 1.16152
13 46.4099 1.332 0.68353 0.49374 1.14584
14 48.9732 1.37742 0.71426 0.49713 1.13077
15 51.5884 1.38346 0.7076 0.49745 1.13032
... ... ... ... ... ...
41 119.006 1.12213 0.71739 0.60639 0.99312
42 121.317 1.15439 0.72032 0.60901 0.99179
43 123.502 1.09189 0.72232 0.61391 0.98833
44 126.063 1.11224 0.81704 0.61871 0.98884
45 128.791 1.18574 0.81506 0.61953 0.99165
46 131.331 1.07524 0.80147 0.62052 0.99334
47 133.926 1.0565 0.79649 0.62348 0.99335
48 136.49 1.09021 0.79817 0.6232 0.99135
49 138.96 1.09611 0.78176 0.6221 0.99239
50 141.386 1.08103 0.79039 0.61891 0.99291

⚠️ Why this PR supports single-XPU only

  • This limitation is intentional.

    • Ultralytics’ training pipeline currently assumes CUDA semantics:
    • device list comes from CUDA_VISIBLE_DEVICES
    • DDP initialization infers world_size from CUDA-style strings (“0,1,2”)
    • backend init tightly couples CUDA → NCCL
    • Multi-XPU requires:
    • backend abstraction
    • device parser refactor
    • optional oneccl initialization
    • distributed launch redesign
  • To keep this PR minimal, safe and upstream-ready, only single-XPU support is implemented.

  • Multi-XPU can be added later after structural refactoring.

✅ Summary

This PR provides:

  • Full single-device Intel XPU support
  • Zero regression for CUDA, CPU, and MPS
  • Correct device selection, info, sync, and memory reporting
  • Clean, minimal, backwards-compatible patch
  • Verified by both forward and long-duration training tests

It significantly broadens Ultralytics’ ecosystem beyond CUDA-only hardware.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Adds initial Intel XPU support across device selection, memory management, timing, AMP checks, and testing for YOLO11.

📊 Key Changes

  • ➕ Introduces tests/xpu_test.py to validate YOLO11 forward pass on Intel XPU (model.to("xpu")).
  • 🧠 Extends Trainer._get_memory and _clear_memory to correctly report and clear memory on Intel XPU devices.
  • ⚙️ Updates check_amp to detect Intel XPU and explicitly disable AMP on XPU with a clear log warning.
  • 💻 Enhances get_gpu_info to return Intel XPU device name and memory when XPU is available.
  • 🎯 Updates select_device to recognize xpu targets, log XPU device info, and return a proper torch.device("xpu", index).
  • ⏱️ Modifies time_sync to synchronize Intel XPU before timing when available.

🎯 Purpose & Impact

  • ✅ Enables running YOLO11 models on Intel XPU devices with proper device selection and memory handling.
  • 🧪 Improves reliability by adding a dedicated XPU test to ensure forward passes work on Intel hardware.
  • 🔍 Provides clearer logging and behavior for AMP usage on XPU, preventing unsupported configurations.
  • 🚀 Broadens hardware support, making Ultralytics models more accessible to users with Intel XPU accelerators.

@UltralyticsAssistant UltralyticsAssistant added enhancement New feature or request python Pull requests that update python code labels Nov 28, 2025
@UltralyticsAssistant
Copy link
Member

👋 Hello @hzdzkjdxyjs, thank you for submitting a ultralytics/ultralytics 🚀 PR! This is an automated review assistant, and a Ultralytics engineer will be by shortly to help further. To ensure a seamless integration of your work, please review the following checklist:

  • Define a Purpose: Clearly explain the purpose of your Intel XPU support and related changes in your PR description, and link to any relevant issues. Ensure your commit messages are clear, concise, and adhere to the project's conventions.
  • Synchronize with Source: Confirm your PR is synchronized with the ultralytics/ultralytics main branch. If it's behind, update it by clicking the Update branch button or by running git pull and git merge main locally.
  • Ensure CI Checks Pass: Verify all Ultralytics Continuous Integration (CI) checks are passing. If any checks fail (including new XPU tests), please address the issues.
  • Update Documentation: Update the relevant documentation for any new or modified features, especially around Intel XPU support and device selection (e.g., device=xpu usage).
  • Add Tests: If applicable, include or update tests to cover your changes, and confirm that all tests are passing locally and in CI. You’ve added an XPU test already—please ensure it behaves correctly in environments without XPU (e.g., via skips).
  • Sign the CLA: Please ensure you have signed our Contributor License Agreement if this is your first Ultralytics PR by writing I have read the CLA Document and I sign the CLA in a new message.
  • Minimize Changes: Limit your changes to the minimum necessary for your feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀

Copy link
Member

@UltralyticsAssistant UltralyticsAssistant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 PR Review

Made with ❤️ by Ultralytics Actions

Please fix the XPU selection/logging path so CUDA requests are not intercepted, restore get_gpu_info() to report the correct backend, and ensure _clear_memory() always empties XPU caches when training on XPU devices.

💬 Posted 3 inline comments

mps = device in {"mps", "mps:0"} # Apple Metal Performance Shaders (MPS)
if cpu or mps:
os.environ["CUDA_VISIBLE_DEVICES"] = "" # force torch.cuda.is_available() = False
elif hasattr(torch, "xpu") and torch.xpu.is_available():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: This elif hasattr(torch, "xpu")... block runs any time XPU is available, even when the caller explicitly requested another backend such as device="cuda:0". For those requests index stays None, so the CUDA path below is never reached and the function falls through without returning the desired CUDA device (often defaulting to CPU). Please gate this branch so it only executes when the requested target is actually XPU (or no device was specified) and allow the existing CUDA logic to handle CUDA inputs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@functools.lru_cache
def get_gpu_info(index):
"""Return a string with system GPU information, i.e. 'Tesla T4, 15102MiB'."""
if hasattr(torch, "xpu") and torch.xpu.is_available():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ HIGH: get_gpu_info() now unconditionally returns XPU properties whenever torch.xpu.is_available(), so CUDA calls (which still invoke this helper) will try to index XPU devices and either log the wrong hardware or raise when CUDA and XPU device counts differ. Only use the XPU path when the caller is querying an XPU device (or when CUDA is unavailable) and keep the CUDA branch as the default for CUDA indices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

torch.mps.empty_cache()
elif self.device.type == "cpu":
return
elif hasattr(torch, "xpu") and torch.xpu.is_available():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 MEDIUM: torch.xpu.empty_cache() is only executed when CUDA is completely unavailable. On systems that expose both CUDA and XPU, running on XPU will skip cache clearing and leak memory because the code falls into the CUDA branch instead. Tie this branch to self.device.type == "xpu" (similar to the MPS branch) so XPU memory is cleared irrespective of CUDA availability.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hzdzkjdxyjs
Copy link
Author

关于cuda和xpu冲突问题,我认为这个问题不会有影响,因为在一个环境中你无法同时安装cuda或者xpu的torch,所以对于他们的代码设计来说,不需要考虑冲突问题

@glenn-jocher
Copy link
Member

Thanks for the very detailed PR and extra note on the CUDA/XPU interaction.

You’re right that with the current Intel wheels you typically don’t get CUDA and XPU in the same environment, but for long‑term robustness it would still be better if the XPU paths only triggered when the selected device is actually XPU, rather than just when torch.xpu.is_available(). For example, in _get_memory and _clear_memory we can key off self.device.type so a hypothetical future build with both backends can’t accidentally route CUDA runs through XPU helpers:

def _get_memory(self, fraction=False):
    memory, total = 0, 0
    if self.device.type == "mps":
        ...
    elif self.device.type == "xpu":
        memory = torch.xpu.memory_allocated(self.device)
        total = torch.xpu.get_device_properties(self.device).total_memory
        return (memory / total if total > 0 else 0) if fraction else (memory / 2**30)
    elif self.device.type != "cpu":
        memory = torch.cuda.memory_reserved()
        if fraction:
            total = torch.cuda.get_device_properties(self.device).total_memory
    return (memory / total if total > 0 else 0) if fraction else (memory / 2**30)

and similarly in _clear_memory only call torch.xpu.empty_cache() when self.device.type == "xpu". In select_device, keeping XPU mapping behind explicit device strings like xpu / xpu:0 and leaving bare numeric strings ("0", "0,1", etc.) for CUDA will also avoid surprises if a mixed backend ever appears.

If you can update those pieces along these lines, the rest of the changes look like a good, minimal first step for single‑XPU support and we can continue the detailed review in this PR.

@codecov
Copy link

codecov bot commented Nov 29, 2025

Codecov Report

❌ Patch coverage is 25.80645% with 23 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
ultralytics/utils/torch_utils.py 20.00% 16 Missing ⚠️
ultralytics/engine/trainer.py 16.66% 5 Missing ⚠️
ultralytics/utils/checks.py 60.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@hzdzkjdxyjs
Copy link
Author

hzdzkjdxyjs commented Nov 29, 2025

为了能够更好的支持您所说的情况我做了以下修改来使框架鲁棒性更强

1.Added standalone support for XPU device information queries, separating it from get_gpu_info, and extended system information reporting to include XPU details

  • Added XPU device information retrieval in ultralytics/ultralytics/utils/checks.py collect_system_info()
  • Added a new get_xpu_info ultralytics/ultralytics/utils/checks.py collect_system_info():
@functools.lru_cache
def get_xpu_info(index):
    """Return a string with system GPU information, i.e. 'Tesla T4, 15102MiB'."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
        properties = torch.xpu.get_device_properties(index)
        return f"{properties.name}, {properties.total_memory / (1 << 20):.0f}MiB"
def collect_system_info():
    """Collect and print relevant system information including OS, Python, RAM, CPU, and CUDA.

    Returns:
        (dict): Dictionary containing system information.
    """
    import psutil  # scoped as slow import

    from ultralytics.utils import ENVIRONMENT  # scope to avoid circular import
    from ultralytics.utils.torch_utils import get_cpu_info, get_gpu_info, get_xpu_info

    gib = 1 << 30  # bytes per GiB
    cuda = torch.cuda.is_available()
    xpu = hasattr(torch, "xpu") and torch.xpu.is_available() 
    check_yolo()
    total, _used, free = shutil.disk_usage("/")

    info_dict = {
        "OS": platform.platform(),
        "Environment": ENVIRONMENT,
        "Python": PYTHON_VERSION,
        "Install": "git" if GIT.is_repo else "pip" if IS_PIP_PACKAGE else "other",
        "Path": str(ROOT),
        "RAM": f"{psutil.virtual_memory().total / gib:.2f} GB",
        "Disk": f"{(total - free) / gib:.1f}/{total / gib:.1f} GB",
        "CPU": get_cpu_info(),
        "CPU count": os.cpu_count(),
        "GPU": get_gpu_info(index=0) if cuda else None,
        "XPU": get_xpu_info() if xpu else None,
        "GPU count": torch.cuda.device_count() if cuda else None,
        "XPU count": torch.xpu.device_count() if xpu else None,
        "CUDA": torch.version.cuda if cuda else None,
    }
  • to verify that this API is present and operational
(base) root@b60:~# conda activate B60
(B60) root@b60:~# python
Python 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.xpu.device_count()
2

2.Ensure that only explicit xpu device strings (e.g., xpu, xpu:0) enter this branch, preventing numeric CUDA-style indices such as "0" or "1" from being interpreted as XPU devices. The XPU device mapping is preserved strictly as xpu:index.

  • To verify that the model is correctly running on the XPU device,(e.g., xpu,:1) I used:
watch -n 0.5 xpu-smi stats -d 1 -j
  • Training is running on xpu:1.
微信图片_20251129124145_43_14
    elif device.startswith("xpu"):  # Intel XPU
        parts = device.split(":")
        index = int(parts[1]) if len(parts) > 1 else 0
        if verbose:
            info = get_xpu_info(index)
            s += f"XPU:{index} ({info})\n"
            LOGGER.info(s if newline else s.rstrip())
        return torch.device("xpu", index)

3.device-type checks are used when performing memory queries and related operations

  • /ultralytics/ultralytics/engine/trainer.py _get_memory()/_clear_memory()
    def _get_memory(self, fraction=False):
        """Get accelerator memory utilization in GB or as a fraction of total memory."""
        memory, total = 0, 0
        if self.device.type == "mps":
            memory = torch.mps.driver_allocated_memory()
            if fraction:
                return __import__("psutil").virtual_memory().percent / 100
        elif self.device.type != "cpu" and self.device.type == "xpu":
            memory = torch.xpu.memory_allocated(self.device)
            total = torch.xpu.get_device_properties(self.device).total_memory
            return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)
        elif self.device.type != "cpu":
            memory = torch.cuda.memory_reserved()
            if fraction:
                total = torch.cuda.get_device_properties(self.device).total_memory
        return ((memory / total) if total > 0 else 0) if fraction else (memory / 2**30)
    def _clear_memory(self, threshold: float | None = None):
        """Clear accelerator memory by calling garbage collector and emptying cache."""
        if threshold:
            assert 0 <= threshold <= 1, "Threshold must be between 0 and 1."
            if self._get_memory(fraction=True) <= threshold:
                return
        gc.collect()
        if self.device.type == "mps":
            torch.mps.empty_cache()
        elif self.device.type == "cpu":
            return
        elif self.device.type == "xpu":
            torch.xpu.empty_cache()
        else:
            torch.cuda.empty_cache()

4.renamed to test_xpu to comply with the test script naming conventions

  • ultralytics/tests/test_xpu.py

5.Lastly, I added an example in the Markdown file demonstrating how to train a model. You may update it on the website to make it easier for users to follow.

  • Specifying device=xpu will automatically run on xpu:0.
yolo train model=yolo11n.pt data=coco128.yaml epochs=50 imgsz=256 device=xpu
image image
  • If you want to run on a different device, you can explicitly specify:
device=xpu:1
device=xpu:2

@hzdzkjdxyjs
Copy link
Author

如果想要进一步支持多卡训练,intel的工程师是这样和我说的,以下这几个函数是支持多卡的最重要的函数

  • vllm/vllm/platforms/init.py xpu_platform_plugin()

  • vllm/vllm/utils/torch_utils.py

def xpu_platform_plugin() -> str | None:
    is_xpu = False
    logger.debug("Checking if XPU platform is available.")
    try:
        # installed IPEX if the machine has XPUs.
        import intel_extension_for_pytorch  # noqa: F401
        import torch

        if supports_xccl():
            dist_backend = "xccl"
        else:
            dist_backend = "ccl"
            import oneccl_bindings_for_pytorch  # noqa: F401

        if hasattr(torch, "xpu") and torch.xpu.is_available():
            is_xpu = True
            from vllm.platforms.xpu import XPUPlatform

            XPUPlatform.dist_backend = dist_backend
            logger.debug("Confirmed %s backend is available.", XPUPlatform.dist_backend)
            logger.debug("Confirmed XPU platform is available.")
    except Exception as e:
        logger.debug("XPU platform is not available because: %s", str(e))

    return "vllm.platforms.xpu.XPUPlatform" if is_xpu else None
def supports_xccl() -> bool:
    return (
        is_torch_equal_or_newer("2.8.0.dev") and torch.distributed.is_xccl_available()
    )
  • 他的xccl和ccl两个dist
  • 要导入对应的包才能支持注册,因为native不在包中
  • 其他具体的我也在研究,我目前能跑通多卡的案例就是这个,intel的团队暂时也没有对应的训练支持,也许我们走在了他们前面
  • 现在我能够提供的信息就是那么多
  • 当然还有一点比较重要intel有一个参数可以控制oneAPI来底层控制显卡,但不能控制torch
  • ZE_AFFINITY_MASK= index,index1···
  • https://github.com/hzdzkjdxyjs/How-to-use-llamafactory-at-B60Pro/blob/main/Muliti_GPU_Train.md

现在我的想法是我想研究一下为什么llamafactory可以支持多卡训练,可能这是关键,但是llamafactory似乎本身不支持多卡训练,支持多卡训练的是背后的pip包比如transformer

@glenn-jocher
Copy link
Member

Nice follow‑up, the XPU‑specific helpers and stricter device handling make the design more robust, and the renamed test_xpu looks good for targeted coverage.

A couple of small tweaks will help clean things up before we go deeper into review:

  1. It would be better to keep get_xpu_info alongside get_gpu_info in ultralytics/utils/torch_utils.py and import it into collect_system_info(), and to give it a default index so your call in collect_system_info works without arguments, for example:
@functools.lru_cache
def get_xpu_info(index: int = 0):
    """Return a string with system XPU information, i.e. 'Intel(R) Graphics..., 15102MiB'."""
    if hasattr(torch, "xpu") and torch.xpu.is_available():
        props = torch.xpu.get_device_properties(index)
        return f"{props.name}, {props.total_memory / (1 << 20):.0f}MiB"
    return None

and in collect_system_info() something like:

"XPU": get_xpu_info(0) if xpu else None,
"XPU count": torch.xpu.device_count() if xpu else None,
  1. In _get_memory, the condition elif self.device.type != "cpu" and self.device.type == "xpu": can just be elif self.device.type == "xpu": to avoid redundant checks and keep the flow clearly mps → xpu → cuda.

The multi‑XPU notes and the xpu_platform_plugin / xccl hints are very helpful context, but as you said they imply a larger refactor (backend abstraction, DDP launch, optional IPEX/oneCCL integration), so it would be best to keep this PR strictly single‑XPU and minimal, and treat multi‑XPU as a follow‑up design/PR once this path is stable.

The CLI device=xpu example is also useful; once XPU support is merged we can look at adding a short Intel XPU section to the training docs on Ultralytics documentation so users can discover it easily.

@hzdzkjdxyjs
Copy link
Author

非常感谢您耐心的指导,确实这样写代码逻辑会更严谨,我按照您的要求已经做了修改,希望有机会能和您的团队一起尝试进行多卡训练的修改,我也会继续探索相关的研究。

@hzdzkjdxyjs
Copy link
Author

First, I want to express my sincere gratitude — thank you for patiently guiding me through the revisions.
Also, this is my very first PR, so I’d like to ask: approximately how long does it usually take for a PR to be reviewed and merged?
I’m really happy to contribute to this open-source project and excited to see my work being helpful.

@hzdzkjdxyjs
Copy link
Author

考虑到本xpu是单卡训练,那么会有两种情况,第一种情况是使用多卡训练,在本函数中,我添加了多卡识别,通过字符“,”识别,第二个是超过xpu的索引告警,此告警由torch执行

  • /ultralytics/ultralytics/utils/torch_utils.py select_device()
    elif device.startswith("xpu"):  # Intel XPU
        index_str = device.split(":", 1)[1] if ":" in device else "0"
        if "," in index_str:
            msg = f"Invalid XPU 'device={device}' requested. Use a single index 0-15."
            LOGGER.warning(msg)
            raise ValueError(msg)
        index = int(index_str)
        if verbose:
            info = get_xpu_info(index)
            s += f"XPU:{index} ({info})\n"
            LOGGER.info(s if newline else s.rstrip())
        return torch.device("xpu", index)
  • 原来的函数如果使用多卡训练时的情况
(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:0,1
Traceback (most recent call last):
  File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
    sys.exit(entrypoint())
  File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
    self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
  File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
    super().__init__(cfg, overrides, _callbacks)
  File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
    self.device = select_device(self.args.device)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 198, in select_device
    index = int(index_str)
ValueError: invalid literal for int() with base 10: '0,1'
  • 现在的函数超过单卡会由本函数报警
(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:0,1
WARNING ⚠️ Invalid XPU 'device=xpu:0,1' requested. Only a single XPU device is supported.
Traceback (most recent call last):
  File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
    sys.exit(entrypoint())
  File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
    self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
  File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
    super().__init__(cfg, overrides, _callbacks)
  File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
    self.device = select_device(self.args.device)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 198, in select_device
    raise ValueError(msg)
ValueError: Invalid XPU 'device=xpu:0,1' requested. Only a single XPU device is supported.
  • intel最新的蓝戟B60Pro双芯显卡最多在一台服务器上能插16张,对于非整数和超过限制的情况均会报警
  • 超过最大index会由torch报警
(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:16
Traceback (most recent call last):
  File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
    sys.exit(entrypoint())
  File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
    self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
  File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
    super().__init__(cfg, overrides, _callbacks)
  File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
    self.device = select_device(self.args.device)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 206, in select_device
    info = get_xpu_info(index)
  File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 137, in get_xpu_info
    properties = torch.xpu.get_device_properties(index)
  File "/root/anaconda3/envs/B60/lib/python3.10/site-packages/torch/xpu/__init__.py", line 262, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]  # noqa: F821
RuntimeError: The device index is out of range. It must be in [0, 2), but got 16.

@glenn-jocher
Copy link
Member

Thanks again for all the careful updates—your latest changes around select_device (explicitly rejecting device=xpu:0,1 and keeping XPU strictly single‑device) plus the XPU‑specific helpers are well aligned with the scope we discussed and will make misconfigurations much clearer for users.

On review/merge timing, we can’t promise a specific timeframe, but this PR is now in a good shape for further maintainer review and we’ll continue any follow‑up discussion here. For multi‑XPU, treating it as a separate design/PR once this single‑XPU path is stable is exactly the right direction, and the notes you’ve gathered about Intel’s backend (xccl/ccl, affinity masks, etc.) will be very useful when we explore that; thanks for helping push the YOLO ecosystem onto more hardware for the whole community.

@hzdzkjdxyjs
Copy link
Author

boss 我更新了 新的多xpu 训练框架 目前已经完成#22850

@glenn-jocher
Copy link
Member

Nice, thanks for splitting multi‑XPU into a separate PR—keeping #22836 focused on single‑XPU support and handling multi‑XPU in #22850 is exactly what we need; we’ll continue any multi‑XPU discussion and review on that new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants