Stars
verl: Volcano Engine Reinforcement Learning for LLMs
Code for the paper βFour Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scalingβ
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA teβ¦
Helpful kernel tutorials and examples for tile-based GPU programming
Accelerating MoE with IO and Tile-aware Optimizations
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable SparseβLinear Attention
TurboDiffusion: 100β200Γ Acceleration for Video Diffusion Models
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Offline optimization of your disaggregated Dynamo graph
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
ππ Efficient implementations of Native Sparse Attention
Distributed Compiler based on Triton for Parallel Systems
FlashInfer: Kernel Library for LLM Serving
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.
An efficient implementation of the NSA (Native Sparse Attention) kernel
β¨ Perfect virtual display for game streaming
Self-hosted game stream host for Moonlight.
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
MoBA: Mixture of Block Attention for Long-Context LLMs
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
Ongoing research training transformer models at scale
DeepEP: an efficient expert-parallel communication library
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashMLA: Efficient Multi-head Latent Attention Kernels
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Tile primitives for speedy kernels