Lists (7)
Sort Name ascending (A-Z)
Starred repositories
REverse-Engineered Reasoning for Open-Ended Generation
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
A machine learning compiler for GPUs, CPUs, and ML accelerators
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
collection of benchmarks to measure basic GPU capabilities
the resources about the application based on LLM with RAG pattern
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek-R1, Qwen3, Gemma 3, TTS 2x faster with 70% less VRAM.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
Official code repo for the O'Reilly Book - "Hands-On Large Language Models"
how to learn PyTorch and OneFlow
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
PyTorch native quantization and sparsity for training and inference
Development repository for the Triton language and compiler
A high performance and generic framework for distributed DNN training
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Ring attention implementation with flash attention
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。