-
SEU
-
14:44
(UTC -12:00)
Lists (3)
Sort Name ascending (A-Z)
Stars
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A Large-Scale Computation Graph Database for Tensor Compiler Research
A torch model extract tool which is helpful in building the torch unit test files.
RbRe145 / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
A Datacenter Scale Distributed Inference Serving Framework
Minimalistic large language model 3D-parallelism training
Fast and memory-efficient exact attention
My learning notes/codes for ML SYS.
Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSy…
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Tutorials for writing high-performance GPU operators in AI frameworks.
《Machine Learning Systems: Design and Implementation》- Chinese Version
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
A CUDA tutorial to make people learn CUDA program from 0
Learning materials for Stanford CS149 : Parallel Computing