[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,042 311 Updated Dec 22, 2025

nico / ninjatracing

Convert .ninja_log files to chrome's about:tracing format.

Python 498 53 Updated Jun 5, 2024

Bruce-Lee-LY / flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 44 6 Updated Feb 27, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 21,583 2,278 Updated Jan 13, 2026

whitelok / cuDNN-convolution3D-invoke-demo

Convolution 3D cuDNN C++ implement demo 三维卷积的cuDNN实现样例 3次元畳み込みのcuDNN実装例

C++ 7 Updated Feb 5, 2020

libjpeg-turbo / libjpeg-turbo

Main libjpeg-turbo repository

C 4,180 1,116 Updated Jan 12, 2026

CVCUDA / CV-CUDA

CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.

C++ 2,634 248 Updated Nov 15, 2025

GeeeekExplorer / nano-vllm

Nano vLLM

Python 10,723 1,375 Updated Nov 3, 2025

pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Python 2,921 379 Updated Jan 13, 2026

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 1,002 140 Updated Jan 13, 2026

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Python 4,623 644 Updated Jan 13, 2026

flame / blis

BLAS-like Library Instantiation Software Framework

C 2,590 406 Updated Nov 11, 2025

ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.

C++ 3,101 813 Updated Jan 9, 2026

lyuchuny3 / Tengine_gemm_tutorial

Tengine gemm tutorial, step by step

C 13 2 Updated Mar 12, 2021

ikawrakow / ik_llama.cpp

llama.cpp fork with additional SOTA quants and improved performance

C++ 1,494 181 Updated Jan 13, 2026

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,056 794 Updated Jan 6, 2026

deepseek-ai / FlashMLA

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,966 927 Updated Dec 15, 2025

nbasyl / LLM-FP4

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 219 22 Updated Dec 15, 2023

harleyszhang / llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

Python 858 88 Updated Dec 10, 2025

pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 96,577 26,487 Updated Jan 13, 2026

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,063 225 Updated Jan 12, 2026

thu-nics / DiTFastAttn

Jupyter Notebook 189 10 Updated Jan 14, 2025

deepseek-ai / DeepSeek-V3

Python 101,076 16,467 Updated Aug 28, 2025

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,565 501 Updated Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QZH-eng

Block or report QZH-eng

Stars

TIMMY-CHAN / MindWatcher

microsoft / vattention

excalidraw / excalidraw

Cyan4973 / xxHash

MoonshotAI / MoBA

InternLM / lmdeploy

thu-ml / SageAttention