- San Franciso, CA
Stars
Distributed Compiler based on Triton for Parallel Systems
🚀 Efficient implementations of state-of-the-art linear attention models
Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approaches
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…
Hackable and optimized Transformers building blocks, supporting a composable construction.
Transformer related optimization, including BERT, GPT
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
Library providing helpers for the Linux kernel io_uring support
A performant and modular runtime for TensorFlow
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
Simple Training and Deployment of Fast End-to-End Binary Networks
bertmaher / pytorch
Forked from pytorch/pytorchTensors and Dynamic neural networks in Python with strong GPU acceleration
The Tensor Algebra SuperOptimizer for Deep Learning
darktable is an open source photography workflow application and raw developer
Productive, portable, and performant GPU programming in Python.
HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training