Highlights
- Pro
Stars
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Awesome LLM compression research papers and tools.
Understanding Deep Learning - Simon J.D. Prince
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
My learning notes/codes for ML SYS.
Code for the paper "Evaluating Large Language Models Trained on Code"
A framework for the evaluation of autoregressive code generation language models.
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
A low-latency & high-throughput serving engine for LLMs
A throughput-oriented high-performance serving framework for LLMs
ROCm / flash-attention
Forked from Dao-AILab/flash-attentionFast and memory-efficient exact attention
8-bit CUDA functions for PyTorch
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
Running large language models on a single GPU for throughput-oriented scenarios.
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
Model Compression Toolbox for Large Language Models and Diffusion Models
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling