Skip to content
View QZH-eng's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report QZH-eng

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

36 3 Updated Dec 30, 2025

Dynamic Memory Management for Serving LLMs without PagedAttention

C 454 35 Updated May 30, 2025

Virtual whiteboard for sketching hand-drawn like diagrams

TypeScript 114,241 12,123 Updated Jan 12, 2026

Extremely fast non-cryptographic hash algorithm

C 10,723 878 Updated Dec 17, 2025

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 2,033 129 Updated Apr 3, 2025

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 7,512 646 Updated Jan 12, 2026

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,038 311 Updated Dec 22, 2025

Convert .ninja_log files to chrome's about:tracing format.

Python 498 53 Updated Jun 5, 2024

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 44 6 Updated Feb 27, 2025

Fast and memory-efficient exact attention

Python 21,558 2,276 Updated Jan 12, 2026

Convolution 3D cuDNN C++ implement demo 三维卷积的cuDNN实现样例 3次元畳み込みのcuDNN実装例

C++ 7 Updated Feb 5, 2020

Main libjpeg-turbo repository

C 4,180 1,116 Updated Jan 12, 2026

CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.

C++ 2,634 248 Updated Nov 15, 2025

Nano vLLM

Python 10,708 1,371 Updated Nov 3, 2025

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Python 2,921 379 Updated Jan 12, 2026

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 1,000 139 Updated Jan 12, 2026

FlashInfer: Kernel Library for LLM Serving

Python 4,607 641 Updated Jan 12, 2026

BLAS-like Library Instantiation Software Framework

C 2,589 406 Updated Nov 11, 2025

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.

C++ 3,101 813 Updated Jan 9, 2026

Tengine gemm tutorial, step by step

C 13 2 Updated Mar 12, 2021

llama.cpp fork with additional SOTA quants and improved performance

C++ 1,487 181 Updated Jan 12, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,054 792 Updated Jan 6, 2026

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,964 926 Updated Dec 15, 2025

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 219 22 Updated Dec 15, 2023

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

Python 858 88 Updated Dec 10, 2025

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 96,561 26,488 Updated Jan 12, 2026

Tile primitives for speedy kernels

Cuda 3,059 225 Updated Jan 12, 2026
Jupyter Notebook 189 10 Updated Jan 14, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,563 501 Updated Jan 12, 2026
Next