-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🐛 Describe the bug
This isn't exactly a bug, per sé, but it is misleading. Thanks to @mikaylagawarecki pointing out the following phenomenon in a parallel file, I'm realizing we have the following behavior in torch/headeronly/util/Half.h today:
Consider the following ifdef
pytorch/torch/headeronly/util/Half.h
Lines 44 to 47 in 6861fa4
#if (defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512)) && \ | |
!defined(__APPLE__) | |
#include <torch/headeronly/cpu/vec/vec_half.h> | |
#endif |
When libtorch is compiling Half.h, it will properly generate the fast vectorization logic depending on how CPU_CAPABILITY_AVX2 and CPU_CAPABILITY_AVX512 is set. Great. This is expected.
What may be unexpected is that custom ops including the headeronly Half.h will not have CPU_CAPABILITY_AVX2 or CPU_CAPABILITY_AVX512 set and so will not have performant CPU code for float2half_scalar
and half2float_scalar
of Half.h.
Versions
on main
cc @malfet @seemethere @chauhang @penguinwu @zou3519 @bdhirsh @swolchok