Skip to content
View lixcli's full-sized avatar

Block or report lixcli

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

A unified inference and post-training framework for accelerated video generation.

Python 2,870 229 Updated Dec 28, 2025

Light Video Generation Inference Framework

Python 1,577 111 Updated Dec 28, 2025

FlashInfer: Kernel Library for LLM Serving

Python 4,375 618 Updated Dec 28, 2025

Trainable fast and memory-efficient sparse attention

Python 501 46 Updated Dec 27, 2025

LongBench v2 and LongBench (ACL 25'&24')

Python 1,051 112 Updated Jan 15, 2025

The HELMET Benchmark

Jupyter Notebook 198 37 Updated Dec 4, 2025

QeRL enables RL for 32B LLMs on a single H100 GPU.

Python 470 46 Updated Nov 27, 2025

If NVINT8 exists, the performance is ...

Jupyter Notebook 1 1 Updated Oct 27, 2025

Triton implementation of FlashAttention2 that adds Custom Masks.

Python 157 15 Updated Aug 14, 2024

The evaluation framework for training-free sparse attention in LLMs

Python 108 8 Updated Oct 13, 2025

PaddleFormers is an easy-to-use library of pre-trained large language model zoo based on PaddlePaddle.

Python 12,947 2,161 Updated Dec 27, 2025
Python 12 5 Updated Oct 23, 2025

Unified KV Cache Compression Methods for Auto-Regressive Models

Python 1,295 159 Updated Jan 4, 2025

🚀 Efficient implementations of state-of-the-art linear attention models

Python 4,133 341 Updated Dec 28, 2025

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Python 1,477 125 Updated Dec 27, 2025
Python 1,373 120 Updated Sep 12, 2025

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,170 73 Updated Sep 30, 2025

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Python 1,405 117 Updated Nov 13, 2025

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 219 22 Updated Dec 15, 2023

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.

Python 184 19 Updated Apr 2, 2025

An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs

Python 609 66 Updated Dec 26, 2025

Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth

Python 217 10 Updated Dec 27, 2025
Python 575 50 Updated Oct 29, 2024

An open-source efficient deep learning framework/compiler, written in python.

Python 737 68 Updated Sep 4, 2025
Python 160 17 Updated Jun 22, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 17,845 2,919 Updated Dec 28, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 4,317 359 Updated Dec 27, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 66,329 12,240 Updated Dec 28, 2025
Next