Starred repositories
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
Examples of CUDA implementations by Cutlass CuTe
Examples for Recommenders - easy to train and deploy on accelerated infrastructure.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
Running large language models on a single GPU for throughput-oriented scenarios.
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
A machine learning compiler for GPUs, CPUs, and ML accelerators
HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashMLA: Efficient Multi-head Latent Attention Kernels
My learning notes/codes for ML SYS.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Step-by-step optimization of CUDA SGEMM
flash attention tutorial written in python, triton, cuda, cutlass
how to optimize some algorithm in cuda.
😎 Python Asyncio 精选资源列表,囊括了网络框架,库,软件等资源
A high-throughput and memory-efficient inference and serving engine for LLMs
Practices for Nsight Compute profiling (Command Line)
Instructions, Docker images, and examples for Nsight Compute and Nsight Systems
GLake: optimizing GPU memory management and IO transmission.
Applied AI experiments and examples for PyTorch