-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Feature support_xpu #22836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature support_xpu #22836
Conversation
|
👋 Hello @hzdzkjdxyjs, thank you for submitting a
For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀 |
UltralyticsAssistant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔍 PR Review
Made with ❤️ by Ultralytics Actions
Please fix the XPU selection/logging path so CUDA requests are not intercepted, restore get_gpu_info() to report the correct backend, and ensure _clear_memory() always empties XPU caches when training on XPU devices.
💬 Posted 3 inline comments
ultralytics/utils/torch_utils.py
Outdated
| mps = device in {"mps", "mps:0"} # Apple Metal Performance Shaders (MPS) | ||
| if cpu or mps: | ||
| os.environ["CUDA_VISIBLE_DEVICES"] = "" # force torch.cuda.is_available() = False | ||
| elif hasattr(torch, "xpu") and torch.xpu.is_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❗ CRITICAL: This elif hasattr(torch, "xpu")... block runs any time XPU is available, even when the caller explicitly requested another backend such as device="cuda:0". For those requests index stays None, so the CUDA path below is never reached and the function falls through without returning the desired CUDA device (often defaulting to CPU). Please gate this branch so it only executes when the requested target is actually XPU (or no device was specified) and allow the existing CUDA logic to handle CUDA inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| @functools.lru_cache | ||
| def get_gpu_info(index): | ||
| """Return a string with system GPU information, i.e. 'Tesla T4, 15102MiB'.""" | ||
| if hasattr(torch, "xpu") and torch.xpu.is_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_gpu_info() now unconditionally returns XPU properties whenever torch.xpu.is_available(), so CUDA calls (which still invoke this helper) will try to index XPU devices and either log the wrong hardware or raise when CUDA and XPU device counts differ. Only use the XPU path when the caller is querying an XPU device (or when CUDA is unavailable) and keep the CUDA branch as the default for CUDA indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ultralytics/engine/trainer.py
Outdated
| torch.mps.empty_cache() | ||
| elif self.device.type == "cpu": | ||
| return | ||
| elif hasattr(torch, "xpu") and torch.xpu.is_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 MEDIUM: torch.xpu.empty_cache() is only executed when CUDA is completely unavailable. On systems that expose both CUDA and XPU, running on XPU will skip cache clearing and leak memory because the code falls into the CUDA branch instead. Tie this branch to self.device.type == "xpu" (similar to the MPS branch) so XPU memory is cleared irrespective of CUDA availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
关于cuda和xpu冲突问题,我认为这个问题不会有影响,因为在一个环境中你无法同时安装cuda或者xpu的torch,所以对于他们的代码设计来说,不需要考虑冲突问题 |
|
Thanks for the very detailed PR and extra note on the CUDA/XPU interaction. You’re right that with the current Intel wheels you typically don’t get CUDA and XPU in the same environment, but for long‑term robustness it would still be better if the XPU paths only triggered when the selected device is actually XPU, rather than just when def _get_memory(self, fraction=False):
memory, total = 0, 0
if self.device.type == "mps":
...
elif self.device.type == "xpu":
memory = torch.xpu.memory_allocated(self.device)
total = torch.xpu.get_device_properties(self.device).total_memory
return (memory / total if total > 0 else 0) if fraction else (memory / 2**30)
elif self.device.type != "cpu":
memory = torch.cuda.memory_reserved()
if fraction:
total = torch.cuda.get_device_properties(self.device).total_memory
return (memory / total if total > 0 else 0) if fraction else (memory / 2**30)and similarly in If you can update those pieces along these lines, the rest of the changes look like a good, minimal first step for single‑XPU support and we can continue the detailed review in this PR. |
373bd71 to
ed68ccf
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
3176256 to
fce4605
Compare
如果想要进一步支持多卡训练,intel的工程师是这样和我说的,以下这几个函数是支持多卡的最重要的函数
def xpu_platform_plugin() -> str | None:
is_xpu = False
logger.debug("Checking if XPU platform is available.")
try:
# installed IPEX if the machine has XPUs.
import intel_extension_for_pytorch # noqa: F401
import torch
if supports_xccl():
dist_backend = "xccl"
else:
dist_backend = "ccl"
import oneccl_bindings_for_pytorch # noqa: F401
if hasattr(torch, "xpu") and torch.xpu.is_available():
is_xpu = True
from vllm.platforms.xpu import XPUPlatform
XPUPlatform.dist_backend = dist_backend
logger.debug("Confirmed %s backend is available.", XPUPlatform.dist_backend)
logger.debug("Confirmed XPU platform is available.")
except Exception as e:
logger.debug("XPU platform is not available because: %s", str(e))
return "vllm.platforms.xpu.XPUPlatform" if is_xpu else Nonedef supports_xccl() -> bool:
return (
is_torch_equal_or_newer("2.8.0.dev") and torch.distributed.is_xccl_available()
)
现在我的想法是我想研究一下为什么llamafactory可以支持多卡训练,可能这是关键,但是llamafactory似乎本身不支持多卡训练,支持多卡训练的是背后的pip包比如transformer |
|
Nice follow‑up, the XPU‑specific helpers and stricter A couple of small tweaks will help clean things up before we go deeper into review:
@functools.lru_cache
def get_xpu_info(index: int = 0):
"""Return a string with system XPU information, i.e. 'Intel(R) Graphics..., 15102MiB'."""
if hasattr(torch, "xpu") and torch.xpu.is_available():
props = torch.xpu.get_device_properties(index)
return f"{props.name}, {props.total_memory / (1 << 20):.0f}MiB"
return Noneand in "XPU": get_xpu_info(0) if xpu else None,
"XPU count": torch.xpu.device_count() if xpu else None,
The multi‑XPU notes and the The CLI |
f819fdd to
7d00399
Compare
|
非常感谢您耐心的指导,确实这样写代码逻辑会更严谨,我按照您的要求已经做了修改,希望有机会能和您的团队一起尝试进行多卡训练的修改,我也会继续探索相关的研究。 |
|
First, I want to express my sincere gratitude — thank you for patiently guiding me through the revisions. |
考虑到本xpu是单卡训练,那么会有两种情况,第一种情况是使用多卡训练,在本函数中,我添加了多卡识别,通过字符“,”识别,第二个是超过xpu的索引告警,此告警由torch执行
elif device.startswith("xpu"): # Intel XPU
index_str = device.split(":", 1)[1] if ":" in device else "0"
if "," in index_str:
msg = f"Invalid XPU 'device={device}' requested. Use a single index 0-15."
LOGGER.warning(msg)
raise ValueError(msg)
index = int(index_str)
if verbose:
info = get_xpu_info(index)
s += f"XPU:{index} ({info})\n"
LOGGER.info(s if newline else s.rstrip())
return torch.device("xpu", index)
(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:0,1
Traceback (most recent call last):
File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
sys.exit(entrypoint())
File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
getattr(model, mode)(**overrides) # default args from model
File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
super().__init__(cfg, overrides, _callbacks)
File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
self.device = select_device(self.args.device)
File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 198, in select_device
index = int(index_str)
ValueError: invalid literal for int() with base 10: '0,1'
(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:0,1
WARNING ⚠️ Invalid XPU 'device=xpu:0,1' requested. Only a single XPU device is supported.
Traceback (most recent call last):
File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
sys.exit(entrypoint())
File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
getattr(model, mode)(**overrides) # default args from model
File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
super().__init__(cfg, overrides, _callbacks)
File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
self.device = select_device(self.args.device)
File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 198, in select_device
raise ValueError(msg)
ValueError: Invalid XPU 'device=xpu:0,1' requested. Only a single XPU device is supported.
(B60) root@b60:~/ultralytics# yolo train model=yolo11n.pt data=coco128.yaml epochs=3 imgsz=256 device=xpu:16
Traceback (most recent call last):
File "/root/anaconda3/envs/B60/bin/yolo", line 7, in <module>
sys.exit(entrypoint())
File "/root/ultralytics/ultralytics/cfg/__init__.py", line 985, in entrypoint
getattr(model, mode)(**overrides) # default args from model
File "/root/ultralytics/ultralytics/engine/model.py", line 768, in train
self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
File "/root/ultralytics/ultralytics/models/yolo/detect/train.py", line 63, in __init__
super().__init__(cfg, overrides, _callbacks)
File "/root/ultralytics/ultralytics/engine/trainer.py", line 126, in __init__
self.device = select_device(self.args.device)
File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 206, in select_device
info = get_xpu_info(index)
File "/root/ultralytics/ultralytics/utils/torch_utils.py", line 137, in get_xpu_info
properties = torch.xpu.get_device_properties(index)
File "/root/anaconda3/envs/B60/lib/python3.10/site-packages/torch/xpu/__init__.py", line 262, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined] # noqa: F821
RuntimeError: The device index is out of range. It must be in [0, 2), but got 16. |
|
Thanks again for all the careful updates—your latest changes around On review/merge timing, we can’t promise a specific timeframe, but this PR is now in a good shape for further maintainer review and we’ll continue any follow‑up discussion here. For multi‑XPU, treating it as a separate design/PR once this single‑XPU path is stable is exactly the right direction, and the notes you’ve gathered about Intel’s backend ( |
|
boss 我更新了 新的多xpu 训练框架 目前已经完成#22850 |
I have read the CLA Document and I sign the CLA
This PR adds initial Intel XPU single-device training support to Ultralytics.
It is a minimal, safe, backward-compatible implementation that activates the XPU path only when the installed PyTorch build supports XPU.
🚀 Motivation
Basic Configuration
In fact, the hardware vendor does not matter, because PyTorch is not tightly bound to the GPU model. The only real complexity is the driver installation. As long as your driver installation succeeds, everything should work normally.
Environment Installation
Avoid runtime errors on environments where XPU is not supported.
Warning
This is a warning message — please pay attention to the explanation here.
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/50 1.18G 3.651 5.77 4.3 162 640: 100% ━━━━━━━━━━━━ 8/8 3.4s/it 27.1s Class Images Instances Box(P R mAP50 mAP50-95): 75% ━━━━━━━━━─── 3/4 4.0s/it 6.4s<4.0sTest case
For long-duration stability and stress testing, I increased the training schedule to 50 epochs.
Unfortunately, when training entirely from scratch using the YAML configuration, the model performance is not ideal.
That said, I believe the ecosystem should not remain limited to a single NVIDIA GPU. Therefore, our priority should be to complete the framework-level adaptation first.
Operator-level optimization can come afterward — and at that stage, we will need stronger support and collaboration from the Intel team.
When training using pretrained weights only, the results are noticeably better.
The following results are obtained using partially pretrained weights.
This limitation is intentional.
To keep this PR minimal, safe and upstream-ready, only single-XPU support is implemented.
Multi-XPU can be added later after structural refactoring.
⸻
✅ Summary
This PR provides:
It significantly broadens Ultralytics’ ecosystem beyond CUDA-only hardware.
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Adds initial Intel XPU support across device selection, memory management, timing, AMP checks, and testing for YOLO11.
📊 Key Changes
tests/xpu_test.pyto validate YOLO11 forward pass on Intel XPU (model.to("xpu")).Trainer._get_memoryand_clear_memoryto correctly report and clear memory on Intel XPU devices.check_ampto detect Intel XPU and explicitly disable AMP on XPU with a clear log warning.get_gpu_infoto return Intel XPU device name and memory when XPU is available.select_deviceto recognizexputargets, log XPU device info, and return a propertorch.device("xpu", index).time_syncto synchronize Intel XPU before timing when available.🎯 Purpose & Impact