Stars
[NeurIPS 2025] Multipole Attention for Efficient Long Context Reasoning
[Survey] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
"AI-Trader: Can AI Beat the Market?" Live Trading Bench: https://ai4trade.ai
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Open Source Continuous Inference Benchmarking - GB200 NVL72 vs MI355X vs B200 vs H200 vs MI325X & soon™ TPUv6e/v7/Trainium2/3/GB300 NVL72 - DeepSeek 670B MoE, GPTOSS
Disaggregated serving system for Large Language Models (LLMs).
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
An open protocol enabling communication and interoperability between opaque agentic applications.
The evaluation framework for training-free sparse attention in LLMs
🌟 The Multi-Agent Framework: First AI Software Company, Towards Natural Language Programming
A Datacenter Scale Distributed Inference Serving Framework
A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨
🐳 | Dockerfiles for the RunPod container images used for our official templates.
CHAI is a library for dynamic pruning of attention heads for efficient LLM inference.
Development repository for the Triton language and compiler
VQVAEs, GumbelSoftmaxes and friends
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
[ICLR 2025🔥] SVD-LLM & [NAACL 2025🔥] SVD-LLM V2
QLoRA: Efficient Finetuning of Quantized LLMs
VPTQ, A Flexible and Extreme low-bit quantization algorithm
Awesome LLM compression research papers and tools.