Skip to content
View 3outeille's full-sized avatar

Organizations

@huggingface

Block or report 3outeille

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Quantized LLM training in pure CUDA/C++.

C++ 202 10 Updated Oct 11, 2025

Separate from hardware and used to learn some NCCL mechanisms

C++ 23 6 Updated Apr 19, 2024

An effort to "ssh into my Sony camera"

C++ 216 5 Updated Oct 16, 2025

Reduce kernel based on CUTLASS CuTe and TMA.

Cuda 9 Updated Sep 25, 2025

Code for the 9/6 Hackathon

Jupyter Notebook 41 21 Updated Sep 9, 2025

RAM is all you need

Python 180 14 Updated Oct 15, 2025

AI-powered dialogue generation for Animal Crossing villagers using LLMs

Python 282 11 Updated Sep 17, 2025
Python 893 91 Updated Oct 17, 2025

Efficient Transformers computation kernels lowering through MLIR

C++ 3 Updated Oct 15, 2025

πŸ”€ yet another mixture of experts

Python 21 1 Updated Sep 19, 2025

A minimal tensor processing unit (TPU), inspired by Google's TPU V2 and V1

SystemVerilog 965 75 Updated Aug 21, 2025
Python 88 4 Updated Sep 19, 2024

A Quirky Assortment of CuTe Kernels

Python 627 49 Updated Oct 11, 2025

Simple MPI implementation for prototyping or learning

C 286 11 Updated Aug 6, 2025

Experimental repository for research implementation of NoLoCo.

Python 24 4 Updated Jun 15, 2025

AXI, AXI stream, Ethernet, and PCIe components in System Verilog

SystemVerilog 419 75 Updated Oct 16, 2025
Cuda 10 Updated Oct 2, 2025

Research sandbox for decentralized pipelined inference

Python 8 1 Updated May 13, 2025

Atropos is a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse environments

Python 716 165 Updated Oct 16, 2025

Async RL Training at Scale

Python 707 113 Updated Oct 16, 2025

Scripts and instructions for replicating the original FineWeb experiments on LUMI

Shell 8 Updated Apr 25, 2025

Muon fsdp 2

Python 44 4 Updated Aug 8, 2025

Analyze computation-communication overlap in V3/R1.

1,105 143 Updated Mar 21, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,868 304 Updated Mar 10, 2025

Expert Parallelism Load Balancer

Python 1,277 196 Updated Mar 24, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,797 717 Updated Oct 15, 2025

Where GPUs get cooked πŸ‘©β€πŸ³πŸ”₯

Rust 293 15 Updated Sep 17, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,612 956 Updated Oct 17, 2025
Next