Cross-posting this issue from ipex, in case the torch-ccl team is not aware of it.
Key issues:
- Compute and collective communications do not overlap on intel GPU devices
- Collectives block the host thread, rather than launching a kernel and immediately returning (as on NVIDIA devices)
The pytorch profiler traces highlight the issues (copied from the other thread):
A100 Trace
Non-blocking kernel launch and comms/compute overlap.
Intel Max 1550 Trace
Blocking kernel launch and no comms/compute overlap.
See the other thread for more details.