Skip to content
View lixcli's full-sized avatar

Block or report lixcli

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

A unified inference and post-training framework for accelerated video generation.

Python 2,880 230 Updated Dec 30, 2025

Light Image Video Generation Inference Framework

Python 1,614 114 Updated Dec 30, 2025

FlashInfer: Kernel Library for LLM Serving

Python 4,390 618 Updated Dec 30, 2025

Trainable fast and memory-efficient sparse attention

Python 505 47 Updated Dec 27, 2025

LongBench v2 and LongBench (ACL 25'&24')

Python 1,052 114 Updated Jan 15, 2025

The HELMET Benchmark

Jupyter Notebook 197 37 Updated Dec 4, 2025

QeRL enables RL for 32B LLMs on a single H100 GPU.

Python 470 46 Updated Nov 27, 2025

If NVINT8 exists, the performance is ...

Jupyter Notebook 1 1 Updated Oct 27, 2025

Triton implementation of FlashAttention2 that adds Custom Masks.

Python 158 15 Updated Aug 14, 2024

The evaluation framework for training-free sparse attention in LLMs

Python 108 8 Updated Oct 13, 2025

PaddleFormers is an easy-to-use library of pre-trained large language model zoo based on PaddlePaddle.

Python 12,950 2,160 Updated Dec 30, 2025
Python 12 5 Updated Oct 23, 2025

Unified KV Cache Compression Methods for Auto-Regressive Models

Python 1,295 159 Updated Jan 4, 2025

🚀 Efficient implementations of state-of-the-art linear attention models

Python 4,148 345 Updated Dec 30, 2025

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Python 1,489 125 Updated Dec 30, 2025
Python 1,375 120 Updated Sep 12, 2025

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,170 73 Updated Sep 30, 2025

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Python 1,406 117 Updated Nov 13, 2025

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 219 22 Updated Dec 15, 2023

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.

Python 184 19 Updated Apr 2, 2025

An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs

Python 610 66 Updated Dec 26, 2025

Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth

Python 219 10 Updated Dec 27, 2025
Python 576 50 Updated Oct 29, 2024

An open-source efficient deep learning framework/compiler, written in python.

Python 737 68 Updated Sep 4, 2025
Python 161 17 Updated Jun 22, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 17,916 2,931 Updated Dec 30, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 4,338 365 Updated Dec 30, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 66,538 12,294 Updated Dec 30, 2025
Next