Lists (17)
Sort Name ascending (A-Z)
AI deployment
collect AI depolyment toolkit,demo,etcAIGC
diffusion model, GAN, VAE etc.binary neural network
demoire
huge AI model
image-restoration
knowledge distilltion
object tracker
Starred repositories
A unified inference and post-training framework for accelerated video generation.
FlashInfer: Kernel Library for LLM Serving
Trainable fast and memory-efficient sparse attention
If NVINT8 exists, the performance is ...
Triton implementation of FlashAttention2 that adds Custom Masks.
The evaluation framework for training-free sparse attention in LLMs
PaddleFormers is an easy-to-use library of pre-trained large language model zoo based on PaddlePaddle.
Unified KV Cache Compression Methods for Auto-Regressive Models
🚀 Efficient implementations of state-of-the-art linear attention models
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
The official implementation of the EMNLP 2023 paper LLM-FP4
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth
An open-source efficient deep learning framework/compiler, written in python.
verl: Volcano Engine Reinforcement Learning for LLMs
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A high-throughput and memory-efficient inference and serving engine for LLMs