Stars
A Datacenter Scale Distributed Inference Serving Framework
A simple C++11 Thread Pool implementation
User-friendly Desktop Client App for AI Models/LLMs (GPT, Claude, Gemini, Ollama...)
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
FlashInfer: Kernel Library for LLM Serving
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
Accessible large language models via k-bit quantization for PyTorch.
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Official inference framework for 1-bit LLMs
SGLang is a fast serving framework for large language models and vision language models.
List of papers related to neural network quantization in recent AI conferences and journals.
An experimental CPU backend for Triton (https//github.com/openai/triton)
Development repository for the Triton language and compiler
A high-throughput and memory-efficient inference and serving engine for LLMs
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
PyTorch native quantization and sparsity for training and inference
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform