Stars
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
Dynamic Memory Management for Serving LLMs without PagedAttention
Virtual whiteboard for sketching hand-drawn like diagrams
MoBA: Mixture of Block Attention for Long-Context LLMs
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Convert .ninja_log files to chrome's about:tracing format.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Fast and memory-efficient exact attention
Convolution 3D cuDNN C++ implement demo 三维卷积的cuDNN实现样例 3次元畳み込みのcuDNN実装例
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
FlashInfer: Kernel Library for LLM Serving
The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
llama.cpp fork with additional SOTA quants and improved performance
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashMLA: Efficient Multi-head Latent Attention Kernels
The official implementation of the EMNLP 2023 paper LLM-FP4
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Tile primitives for speedy kernels
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.