BBuf

Xiaoyu Zhang BBuf

Working at Skywork.AI and the creator of GiantPandaCV official account.

2.2k followers · 57 following

SkyWork
ChengDu
www.giantpandacv.com

Achievements

x4 x4 x3

Achievements

x4 x4 x3

cache-dit Public
Forked from vipshop/cache-dit

🤗A PyTorch-native Inference Engine with Hybrid Cache Acceleration and Parallelism for DiTs: Z-Image, FLUX2, Qwen-Image, etc.

Python Apache License 2.0 Updated Jan 13, 2026
how-to-optim-algorithm-in-cuda Public

how to optimize some algorithm in cuda.

cuda llm

Cuda 2,753 249 Updated Jan 8, 2026
vllm Public
Forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 1 Apache License 2.0 Updated Nov 29, 2025
sglang Public
Forked from sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 1 Apache License 2.0 Updated Nov 12, 2025
gpu-glossary-zh Public

https://bbuf.github.io/gpu-glossary-zh/

Python 25 Other Updated Nov 7, 2025
tilelang Public
Forked from tile-ai/tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 1 Other Updated Oct 9, 2025
llm_benchmark Public
Forked from lvhan028/llm_benchmark

Python MIT License Updated Sep 9, 2025
lm-sys.github.io Public
Forked from lm-sys/lm-sys.github.io

JavaScript Other Updated Sep 1, 2025
flashinfer Public
Forked from flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda Apache License 2.0 Updated Jul 14, 2025
Awesome-ML-SYS-Tutorial Public
Forked from zhaochenyang20/Awesome-ML-SYS-Tutorial

My learning notes/codes for ML SYS.

Python 5 Apache License 2.0 Updated May 5, 2025
Panzhihua-Mi-Yi-Pipa Public

If you want to purchase Panzhihua Mi Yi Pipa, please contact me.

8 MIT License Updated Mar 24, 2025
tvm_mlir_learn Public

compiler learning resources collect.

Python 2,653 363 Updated Mar 19, 2025
PanZhiHua_MiYi_PiPa Public

Updated Mar 1, 2025
DeepGEMM Public
Forked from deepseek-ai/DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda MIT License Updated Feb 27, 2025
ml-engineering Public
Forked from stas00/ml-engineering

Machine Learning Engineering Open Book

Python 1 Creative Commons Attribution Share Alike 4.0 International Updated Feb 19, 2025
tensorrt-llm-moe Public

C++ 33 2 Updated Feb 3, 2025
HunyuanVideo Public
Forked from Tencent-Hunyuan/HunyuanVideo

HunyuanVideo: A Systematic Framework For Large Video Generation Model

Python Other Updated Dec 20, 2024
cfx-article-src Public
Forked from ColfaxResearch/cfx-article-src

C++ 1 Updated Dec 20, 2024
ao Public
Forked from pytorch/ao

PyTorch native quantization and sparsity for training and inference

Python 1 BSD 3-Clause "New" or "Revised" License Updated Oct 31, 2024
flash-attention Public
Forked from Dao-AILab/flash-attention

Fast and memory-efficient exact attention

Python 1 BSD 3-Clause "New" or "Revised" License Updated Oct 8, 2024
TiledCUDA Public
Forked from TiledTensor/TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

C++ MIT License Updated Sep 6, 2024
ArmNeonOptimization Public

arm-neon

C++ 92 23 Updated Aug 2, 2024
RWKV-World-HF-Tokenizer Public

Python 34 5 Updated Jul 21, 2024
giantpandacv.com Public archive

www.giantpandacv.com

Python 154 31 Other Updated Jun 20, 2024
Image-processing-algorithm Public

paper implement

opencv retinex opencv3 correction-algorithm video-dehazin

C++ 965 284 Updated Jun 11, 2024
deepseekv2-profile Public
Forked from madsys-dev/deepseekv2-profile

Jupyter Notebook Updated May 31, 2024
flash-rwkv Public

Python 32 2 Updated May 26, 2024
accelerate Public
Forked from huggingface/accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Python 1 Apache License 2.0 Updated May 13, 2024
nndeploy Public
Forked from nndeploy/nndeploy

nndeploy是一款模型端到端部署框架。以多端推理以及基于有向无环图模型部署为内核，致力为用户提供跨平台、简单易用、高性能的模型部署体验。

C++ 2 Apache License 2.0 Updated Apr 16, 2024
kineto Public
Forked from pytorch/kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

HTML Other Updated Apr 15, 2024

Xiaoyu Zhang BBuf

Achievements

Achievements

cache-dit Public

Uh oh!

how-to-optim-algorithm-in-cuda Public

Uh oh!

vllm Public

Uh oh!

sglang Public

Uh oh!

gpu-glossary-zh Public

Uh oh!

tilelang Public

Uh oh!

llm_benchmark Public

Uh oh!

lm-sys.github.io Public

Uh oh!

flashinfer Public

Uh oh!

Awesome-ML-SYS-Tutorial Public

Uh oh!

Panzhihua-Mi-Yi-Pipa Public

Uh oh!

tvm_mlir_learn Public

Uh oh!

PanZhiHua_MiYi_PiPa Public

Uh oh!

DeepGEMM Public

Uh oh!

ml-engineering Public

Uh oh!

tensorrt-llm-moe Public

Uh oh!

HunyuanVideo Public

Uh oh!

cfx-article-src Public

Uh oh!

ao Public

Uh oh!

flash-attention Public

Uh oh!

TiledCUDA Public

Uh oh!

ArmNeonOptimization Public

Uh oh!

RWKV-World-HF-Tokenizer Public

Uh oh!

giantpandacv.com Public archive

Uh oh!

Image-processing-algorithm Public

Uh oh!

deepseekv2-profile Public

Uh oh!

flash-rwkv Public

Uh oh!

accelerate Public

Uh oh!

nndeploy Public

Uh oh!

kineto Public

Uh oh!