Lists (10)
Sort Name ascending (A-Z)
Stars
Understanding R1-Zero-Like Training: A Critical Perspective
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Triton based sparse quantization attention kernel collection
Distributed parallel 3D-Causal-VAE for efficient training and inference
High performance inference engine for diffusion models
Toolchain built around the Megatron-LM for Distributed Training
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and caching, etc.
Paper list in the survey: A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocations such as NCCL, ...)
Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature set.
slime is an LLM post-training framework for RL Scaling.
CUDA Matrix Multiplication Optimization
Bridge Megatron-Core to Hugging Face/Reinforcement Learning
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"