Skip to content
View ttanzhiqiang's full-sized avatar

Block or report ttanzhiqiang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Fastest kernels written from scratch

Cuda 382 53 Updated Sep 18, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 3,733 277 Updated Oct 25, 2025

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

Python 117 6 Updated Oct 22, 2025

Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend

Python 79 7 Updated Sep 27, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,829 720 Updated Oct 15, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,656 966 Updated Oct 23, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,873 305 Updated Mar 10, 2025

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,830 895 Updated Sep 30, 2025

Repository hosting code for "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).

Python 1,491 296 Updated Oct 16, 2025

Triton kernels for Flux

Python 22 Updated Jul 7, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.

Cuda 2,581 247 Updated Oct 25, 2025

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Python 580 30 Updated Aug 12, 2025

Fast and memory-efficient exact attention

Python 20,171 2,086 Updated Oct 26, 2025

Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.

Python 266 47 Updated Oct 27, 2025

Development repository for the Triton language and compiler

MLIR 17,363 2,341 Updated Oct 27, 2025

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Python 2,874 368 Updated Oct 26, 2025

程序员相关电子书资料免费分享,欢迎关注个人微信公众号:编程与实战

4,712 1,186 Updated Apr 4, 2024

tensorrt-onnx build window

C++ 5 1 Updated Aug 4, 2021

micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-…

Python 2,262 478 Updated May 6, 2025

Detectron2 Libtorch C++ faster rcnn

C++ 13 2 Updated Aug 6, 2021

Two-stage CenterNet

Python 1,222 188 Updated Nov 20, 2022

RLE(run-length encoding) vs halcon vs opencv

C++ 39 16 Updated Jun 25, 2021

An easy to use PyTorch to TensorRT converter

Python 4,825 694 Updated Aug 17, 2024

Support Yolov5(4.0)/Yolov5(5.0)/YoloR/YoloX/Yolov4/Yolov3/CenterNet/CenterFace/RetinaFace/Classify/Unet. use darknet/libtorch/pytorch/mxnet to onnx to tensorrt

C++ 210 42 Updated Aug 2, 2021

Libtorch Examples

C++ 42 16 Updated Jul 16, 2021

🔥 (yolov3 yolov4 yolov5 unet ...)A mini pytorch inference framework which inspired from darknet.

C++ 746 148 Updated Apr 23, 2023

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

C++ 20,825 6,752 Updated Oct 25, 2023
Next