Stars
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
Distributed MoE in a Single Kernel [NeurIPS '25]
Minimal PyTorch implementation of TP, SP, FSDP and sharded-EMA
🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation feedback, cross-platform NVIDIA/AMD, Kernelbook + KernelBench
Minimum implementation of EDM (Elucidating the Design Space of Diffusion-Based Generative Models) on cifar10 and mnist
Code for "What really matters in matrix-whitening optimizers?"
A foundation model to learn multiple physical systems at once
Code for "Transitive RL: Value Learning via Divide and Conquer"
🎨 Native AI image generation for Apple Silicon with Qwen-Image. Lightning LoRA acceleration for fast 4–8 step runs. Zero Docker, just works.
Chimera: State Space Models Beyond Sequences
brain-lab-research / LIB
Forked from Vepricov/LIBLibrary for FO Optimization
torchax is a PyTorch frontend for JAX. It gives JAX the ability to author JAX programs using familiar PyTorch syntax. It also provides JAX-Pytorch interoperability, meaning, one can mix JAX & Pytor…
TPU inference for vLLM, with unified JAX and PyTorch support.
The official implementation of "Dual Goal Representations"
StreamingVLM: Real-Time Understanding for Infinite Video Streams
The simplest, fastest repository for training/finetuning small-sized VLMs.
Pytorch Implementation (unofficial) of the paper "Mean Flows for One-step Generative Modeling" by Geng et al.