Skip to content
View lixcli's full-sized avatar

Block or report lixcli

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

A unified inference and post-training framework for accelerated video generation.

Python 2,834 226 Updated Dec 20, 2025

Light Video Generation Inference Framework

Python 1,240 78 Updated Dec 19, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,312 606 Updated Dec 20, 2025

Trainable fast and memory-efficient sparse attention

Python 482 46 Updated Dec 19, 2025

LongBench v2 and LongBench (ACL 25'&24')

Python 1,046 111 Updated Jan 15, 2025

The HELMET Benchmark

Jupyter Notebook 195 36 Updated Dec 4, 2025

QeRL enables RL for 32B LLMs on a single H100 GPU.

Python 468 44 Updated Nov 27, 2025

If NVINT8 exists, the performance is ...

Jupyter Notebook 1 1 Updated Oct 27, 2025

Triton implementation of FlashAttention2 that adds Custom Masks.

Python 157 15 Updated Aug 14, 2024

The evaluation framework for training-free sparse attention in LLMs

Python 106 8 Updated Oct 13, 2025

PaddleFormers is an easy-to-use library of pre-trained large language model zoo based on PaddlePaddle.

Python 12,947 2,159 Updated Dec 19, 2025
Python 12 5 Updated Oct 23, 2025

Unified KV Cache Compression Methods for Auto-Regressive Models

Python 1,292 159 Updated Jan 4, 2025

🚀 Efficient implementations of state-of-the-art linear attention models

Python 4,087 333 Updated Dec 20, 2025

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Python 1,439 121 Updated Dec 20, 2025
Python 1,367 120 Updated Sep 12, 2025

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,166 72 Updated Sep 30, 2025

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Python 1,396 118 Updated Nov 13, 2025

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 219 22 Updated Dec 15, 2023

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.

Python 184 19 Updated Apr 2, 2025

An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs

Python 604 65 Updated Dec 13, 2025

Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth

Python 217 10 Updated Nov 6, 2025
Python 574 50 Updated Oct 29, 2024

An open-source efficient deep learning framework/compiler, written in python.

Python 737 68 Updated Sep 4, 2025
Python 159 17 Updated Jun 22, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 17,640 2,856 Updated Dec 20, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 4,262 350 Updated Dec 19, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 65,817 12,086 Updated Dec 20, 2025
Next