-
Sun Yat-sen University
- Guangzhou
-
20:08
(UTC +08:00) - https://gty111.github.io/info/
- https://orcid.org/0009-0005-2979-4486
Highlights
- Pro
Lists (19)
Sort Name ascending (A-Z)
AI
Benchmark
Compiler & DSL
CV & CG
Diffusion
Framework
Hardware
HPC
Instrumention&Reverse&Assemble
LAB
Math
NLP
Operating Systems
Recommendation
ROCM
Simulators
Template & Theme
Tools
Tutorial & Examples
Stars
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
FlashMLA: Efficient Multi-head Latent Attention Kernels
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Official PyTorch implementation for "Large Language Diffusion Models"
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
Achieve state of the art inference performance with modern accelerators on Kubernetes
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
Kimi K2 is the large language model series developed by Moonshot AI team
verl: Volcano Engine Reinforcement Learning for LLMs
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Supercharge Your LLM with the Fastest KV Cache Layer
Distributed Compiler based on Triton for Parallel Systems
The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
Analyze computation-communication overlap in V3/R1.
DeepEP: an efficient expert-parallel communication library
A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations