Starred repositories
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
Zhejiang University Graduation Thesis LaTeX Template
KV cache store for distributed LLM inference
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
ModelScope: bring the notion of Model-as-a-Service to life.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Tensors and Dynamic neural networks in Python with strong GPU acceleration
DeepEP: an efficient expert-parallel communication library
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Fast and memory-efficient exact attention
A Datacenter Scale Distributed Inference Serving Framework
Fast O(1) offset allocator with minimal fragmentation
Kimi K2 is the large language model series developed by Moonshot AI team
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations