Skip to content
View yinghai's full-sized avatar

Block or report yinghai

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A Quirky Assortment of CuTe Kernels

Python 675 61 Updated Nov 21, 2025

Distributed Compiler based on Triton for Parallel Systems

Python 1,251 107 Updated Nov 18, 2025

🚀 Efficient implementations of state-of-the-art linear attention models

Python 3,930 313 Updated Nov 27, 2025

PyTorch Single Controller

Rust 905 109 Updated Nov 27, 2025

Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approaches

C++ 328 29 Updated Apr 25, 2021

Bindings for RDMA ibverbs through rdma-core

Rust 193 54 Updated Nov 13, 2025

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 9,485 967 Updated Oct 24, 2025
Python 7 1 Updated Jul 11, 2024

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,694 382 Updated Oct 27, 2025

Foundation Architecture for (M)LLMs

Python 3,121 222 Updated Apr 11, 2024

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…

Python 2,957 565 Updated Nov 27, 2025

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,129 742 Updated Nov 19, 2025

Multithreaded Python without the GIL

Python 2,919 104 Updated May 20, 2025

Text describing xv6 on RISC-V

TeX 797 185 Updated Sep 2, 2025

Transformer related optimization, including BERT, GPT

C++ 6,354 923 Updated Mar 27, 2024

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

C++ 996 165 Updated Sep 19, 2024

Library providing helpers for the Linux kernel io_uring support

C 3,443 482 Updated Nov 26, 2025

A performant and modular runtime for TensorFlow

C++ 757 123 Updated Sep 4, 2025

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Python 2,894 370 Updated Nov 26, 2025

Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training

C++ 1,842 245 Updated Nov 17, 2025

Simple Training and Deployment of Fast End-to-End Binary Networks

Jupyter Notebook 159 21 Updated Feb 1, 2022

算法竞赛课件分享

4,323 793 Updated Sep 23, 2025

Tensors and Dynamic neural networks in Python with strong GPU acceleration

C++ 3 Updated Apr 11, 2025

The Tensor Algebra SuperOptimizer for Deep Learning

C++ 730 93 Updated Jan 26, 2023

darktable is an open source photography workflow application and raw developer

C 11,664 1,256 Updated Nov 27, 2025

Productive, portable, and performant GPU programming in Python.

C++ 27,754 2,371 Updated Oct 6, 2025

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

C++ 1,037 203 Updated Sep 15, 2025
Next