yinghai

Follow

Yinghai Lu yinghai

Follow

172 followers · 18 following

San Franciso, CA

Achievements

Achievements

Stars

thinking-machines-lab / batch_invariant_ops

Python 912 71 Updated Nov 4, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 675 61 Updated Nov 21, 2025

ByteDance-Seed / Triton-distributed

Distributed Compiler based on Triton for Parallel Systems

Python 1,251 107 Updated Nov 18, 2025

fla-org / flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

Python 3,930 313 Updated Nov 27, 2025

meta-pytorch / monarch

PyTorch Single Controller

Rust 905 109 Updated Nov 27, 2025

komrad36 / CRC

Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approaches

C++ 328 29 Updated Apr 25, 2021

jonhoo / rust-ibverbs

Bindings for RDMA ibverbs through rdma-core

Rust 193 54 Updated Nov 13, 2025

deepseek-ai / 3FS

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 9,485 967 Updated Oct 24, 2025

zdevito / single_controller

Python 7 1 Updated Jul 11, 2024

facebookincubator / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,694 382 Updated Oct 27, 2025

microsoft / torchscale

Foundation Architecture for (M)LLMs

Python 3,121 222 Updated Apr 11, 2024

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…

Python 2,957 565 Updated Nov 27, 2025

facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,129 742 Updated Nov 19, 2025

colesbury / nogil

Multithreaded Python without the GIL

Python 2,919 104 Updated May 20, 2025

mit-pdos / xv6-riscv-book

Text describing xv6 on RISC-V

TeX 797 185 Updated Sep 2, 2025

NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT

C++ 6,354 923 Updated Mar 27, 2024

microsoft / nnfusion

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

C++ 996 165 Updated Sep 19, 2024

axboe / liburing

Library providing helpers for the Linux kernel io_uring support

C 3,443 482 Updated Nov 26, 2025

tensorflow / runtime

A performant and modular runtime for TensorFlow

C++ 757 123 Updated Sep 4, 2025

ajtulloch / CambridgeMathematicsPartIII

TeX 70 17 Updated Oct 29, 2014

pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Python 2,894 370 Updated Nov 26, 2025

ajtulloch / sparse-ads-baselines

Python 10 6 Updated May 4, 2023

flexflow / flexflow-train

Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training

C++ 1,842 245 Updated Nov 17, 2025

jwfromm / Riptide

Simple Training and Deployment of Fast End-to-End Binary Networks

Jupyter Notebook 159 21 Updated Feb 1, 2022

hzwer / shareOI

算法竞赛课件分享

4,323 793 Updated Sep 23, 2025

bertmaher / pytorch

Forked from pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

C++ 3 Updated Apr 11, 2025

jiazhihao / TASO

The Tensor Algebra SuperOptimizer for Deep Learning

C++ 730 93 Updated Jan 26, 2023

darktable-org / darktable

darktable is an open source photography workflow application and raw developer

C 11,664 1,256 Updated Nov 27, 2025

taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.

C++ 27,754 2,371 Updated Oct 6, 2025

NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

C++ 1,037 203 Updated Sep 15, 2025