Stars
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
LLM training parallelisms (DP, FSDP, TP, PP) in pure C
a minimal cache manager for PagedAttention, on top of llama3.
FlashMLA: Efficient Multi-head Latent Attention Kernels
Fully Open Language Models with Stellar Performance
🔥 A minimal training framework for scaling FLA models
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
Machine Learning Engineering Open Book
Fully open reproduction of DeepSeek-R1
This is the homepage of a new book entitled "Mathematical Foundations of Reinforcement Learning."
This repository is a curated collection of resources, tutorials, and practical examples designed to guide you through the journey of mastering CUDA programming. Whether you're just starting or look…
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Unofficial implementation of Titans, SOTA memory for transformers, in Pytorch
Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
A Self-adaptation Framework🐙 that adapts LLMs for unseen tasks in real-time!
What would you do with 1000 H100s...
Minimalistic 4D-parallelism distributed training framework for education purpose
A generic, composable multi-dimensional array library.
HunyuanVideo: A Systematic Framework For Large Video Generation Model
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
"Deep Generative Modeling": Introductory Examples