This is a cpu tool for benchmarking the peak performance of floating-points and AI ISAs.
It can automatically sense the local SIMD|DSA ISAs while compiling.
| Arch | Linux | MacOS | Windows |
|---|---|---|---|
| arm64 | yes | no | no |
| e2k | yes | no | no |
| loongarch64 | yes | no | no |
| riscv64 | yes | no | no |
| x86-64 | yes | no | no |
| Arch | ISA | Feature | Data Type | Description |
|---|---|---|---|---|
| SIMD | SSE | Vector | fp32 | Before Sandy Bridge |
| SIMD | SSE2 | Vector | fp64 | Before Sandy Bridge |
| SIMD | AVX | Vector | fp32/fp64 | From Sandy Bridge |
| SIMD | FMA | Vector | fp32/fp64 | From Haswell/Zen |
| SIMD | AVX512f | Vector | fp32/fp64 | From Skylake X/Zen4 |
| SIMD | AVX512_VNNI | Vector | int8/int16 | From IceLake |
| SIMD | AVX_VNNI | Vector | int8/int16 | From Alder Lake |
| SIMD | AVX512_FP16 | Vector | fp16 | From Intel Sapphire Rapids |
| SIMD | AVX512_BF16 | Vector | bf16 | From AMD Zen4 |
| SIMD | AVX_VNNI_INT8 | Vector | int8 | Unknown |
| DSA | AMX_INT8 | Matrix | int8 | From Intel Sapphire Rapids |
| DSA | AMX_BF16 | Matrix | bf16 | From Intel Sapphire Rapids |
| Arch | ISA | Feature | Data Type | Description |
|---|---|---|---|---|
| SIMD | asimd | Vector | fp32/fp64 | From Cortex-A57/A53 |
| SIMD | asimd_hp | Vector | fp16 | From Cortex-A75/A55 |
| SIMD | asimd_dp | Vector | int8 | From Cortex-A75/A55 |
| SIMD | bf16 | Matrix | bf16 | From Cortex-X2/A710/A510 |
| SIMD | i8mm | Matrix | int8 | From Cortex-X2/A710/A510 |
| Arch | ISA | Feature | Data Type | Description |
|---|---|---|---|---|
| SIMD | V | Vector | fp16/fp32/fp64 | From RISC-V "V" vector extension. Version 1.0 |
| DSA | ime | Matrix | int8 | From SpacemiT-X60 |
NOTE: ime is a SpacemiT custom vendor extension.
| Arch | ISA | Feature | Data Type | Description |
|---|---|---|---|---|
| SIMD | LASX | Vector | fp32/fp64 | From Loongson 3A5000 |
| SIMD | LSX | Vector | fp32/fp64 | From Loongson 3A5000 |
| Scalar | FP | Scalar | fp32/fp64 | From Loongson 3A5000 |
| Arch | ISA | Feature | Vector Width | Data Type | Description |
|---|---|---|---|---|---|
| SIMD | v6 | Vector | 128 | fp32/fp64 | FMA |
| SIMD | v5 | Vector | 128 | fp32/fp64 | Combined operations |
| Scalar | v1-v4 | Scalar | fp64 | Combined operations | |
| SIMD | v1-v4 | Vector | 64 | fp32 | Combined operations |
E2K has support for instructions that perform two independant operations. It is like FMA, but with additional rounding as these operations is independant.
fmul_addd src1, src2, src3, dst
Multiply double-precision (64-bit) floating-point values from src1 and src2,
and add the intermediate result to value from src3. Store the result in dst.
dst[63:0] := src3[63:0] + src1[63:0] * src2[63:0]
| Architecture | Latency | Throughput (CPI) | ALC |
|---|---|---|---|
| elbrus-v4 | 8 | 0.16 | 012345 |
| elbrus-v1 | 8 | 0.25 | 01-34- |
- ALC (Arithmetic Logic Complex/Channel) is an execution port for RISC-like instructions
build x64 version:
./build_x64.sh
build arm64 version:
./build_arm64.sh
build riscv64 version:
./build_riscv64.sh
build loongarch64 version:
./build_loongarch64.sh
build e2k version:
./build_e2k.sh
build arm64 version (for Apple Silicon M series on Asahi Linux):
./build_applem.sh
clean:
./clean.sh
./cpufp --thread_pool=[xxx] --idle_time=yyy
--thread_pool: [xxx] is the list of cpu thread to benchmarking, from setting affinities. Please reference the result of lstopo command. For example, [0,3,5-8,13-15].
--idle_time: the interval time(sec) between any two adjacent benchmarks, default is 0.
x86-64 cpufp benchmark results
riscv64 cpufp benchmark results
loongarch64 cpufp benchmark results
Add armv9(SVE, SVE2 & SME) Supports.