Stars
Flash Attention in ~100 lines of CUDA (forward pass only)
Triton implementation of Flash Attention2.0
[NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Implementation of papers in 100 lines of code.
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
Minimal reproduction of DeepSeek R1-Zero
Instead of running one environment at a time or one per thread, run everything in batch using numpy on a single core.
Fully open reproduction of DeepSeek-R1
🚀 Efficient implementations of state-of-the-art linear attention models
A PyTorch native platform for training generative AI models
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…
FlagGems is an operator library for large language models implemented in the Triton Language.
PyTorch native quantization and sparsity for training and inference
Development repository for the Triton language and compiler
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
A very simple shared memory dict implementation
Simple, minimal implementation of the Mamba SSM in one file of PyTorch.
Seamless operability between C++11 and Python
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
Denoising Diffusion Probabilistic Models
A minimal PyTorch implementation of probabilistic diffusion models for 2D datasets.
An open source implementation of CLIP.
PyTorch Implementation of OpenAI's Image GPT
Large Language Model-enhanced Recommender System Papers
An unnecessarily tiny implementation of GPT-2 in NumPy.
Code and model for the paper "Improving Language Understanding by Generative Pre-Training"