Skip to content

Release v3.8.0

Latest

Choose a tag to compare

@sa-faizal sa-faizal released this 14 Oct 20:22
· 63 commits to main since this release
afe01bd

IREE Release v3.8.0

1. Compiler

1.1 Data Tiling & Scaled Matmul

  • Introduced DataTiledScaledMMAAttr and implemented scaled matmul data tiling materialization using new scaled intrinsic attributes for improved codegen flexibility. (#22176, #22189)
  • Added ping-pong ukernel support for FP8 and FP16 data tiling, tuned for LLaMA workloads, delivering up to 30–40% latency reduction vs. non–data-tiled paths. (#21919)
  • Added ROCm encoding specialization via UKernelProviderInterface for data-tiled ukernels. (#21914)
  • Introduced intentional padded configurations for (I)GEMM to improve convolution performance by ~8% with no degradation in backward paths. (#21931)
  • Disabled data-tiling by default for CPU backends due to memory and backend inconsistencies; it’s now opt-in via --iree-opt-data-tiling, with updated CPU docs and tests reflecting the change. (#21935)
  • Published a detailed blog on Data Tiling introducing how operand layouts are transformed to match hardware-preferred formats for better locality and cache efficiency. (https://iree.dev/community/blog/2025-08-25-data-tiling-walkthrough/)

1.2 Convolution

  • Transposed input backward convolution filter layout from CHWF → FHWC, aligning with matmul_transpose_b and improving performance. (#22100)
  • Reordered iterator dimensions for input backward convolutions to match forward NHWC-FHWC conv layout, simplifying autotuning and shape handling. (#22208)
  • Enabled extract slice propagation during convolution padding to improve fusion opportunities. (#21948)

1.3 Matmul & Vector Distribute

  • Removed virtual MMAs from vector distribute matmul/conv pipelines to fix regressions and restore original performance on Punet configurations. (#22202)
  • Added support for distributing subgroups across multiple M dimensions in vector distribute pipelines, improving parallel utilization. (#22000)

1.4 Others

2. Runtime

  • Split hoisted async constant lifetimes to drastically reduce retained memory (e.g., 9 GB → 500 KB in large tiled workloads). (#21995)
  • Added per–entry-point flags and workgroup size emission, preparing for new HAL APIs and better runtime introspection.
    • ⚠️ Breaking change: local executable library format bumped to v0.6. (#21754, #22078, #21950)
  • Updated GPU executable headers for versioning and added a new infer-format call to safely infer executable data format and size.
    • ⚠️ Breaking change: requires GPU executable recompilation.(#21763)
  • CPU matmul configuration switched to linalg::LinalgOp interface for better op fusion and flexibility. (#21954)
  • General Enhancements and Fixes (#22101, #22110, #22102, #22048, #21921, #22075)

Change Log

Git History

What's Changed

New Contributors

Full Changelog: v3.7.0...v3.8.0