[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,170 73 Updated Sep 30, 2025

NVIDIA / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Python 1,405 117 Updated Nov 13, 2025

nbasyl / LLM-FP4

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 219 22 Updated Dec 15, 2023

Cornell-RelaxML / yaqa-quantization

Python 66 2 Updated Jun 20, 2025

wejoncy / QLLM

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.

Python 184 19 Updated Apr 2, 2025

turboderp-org / exllamav3

An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs

Python 609 66 Updated Dec 26, 2025

woct0rdho / transformers-qwen3-moe-fused

Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth

Python 217 10 Updated Dec 27, 2025

Cornell-RelaxML / quip-sharp

Python 575 50 Updated Oct 29, 2024

hidet-org / hidet

An open-source efficient deep learning framework/compiler, written in python.

Python 737 68 Updated Sep 4, 2025

Cornell-RelaxML / qtip

Python 160 17 Updated Jun 22, 2025

volcengine / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 17,845 2,919 Updated Dec 28, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 4,317 359 Updated Dec 27, 2025

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 66,329 12,240 Updated Dec 28, 2025

lixc lixcli

Lists (17)

AI deployment

AIGC

binary neural network

demoire

huge AI model

image-restoration

knowledge distilltion

object tracker

paper code 👍

paperlist

programming

pruning

quantization

spiking network

super resolution

useful toolkits 💯

visual transformer

Starred repositories

binary-neural-networks