Lists (11)
Sort Name ascending (A-Z)
Stars
A tiny deep learning training framework implemented from scratch in C++ that follows PyTorch's API.
An open-source cross-platform alternative to AirDrop
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
High performance, low latency market trading application written in C++
Building Low Latency Applications with CPP by Packt Publishing
A Easy-to-understand TensorOp Matmul Tutorial
Distributed MoE in a Single Kernel [NeurIPS '25]
Fast and memory efficient c++ flat hash table/map/set
CPU inference for the DeepSeek family of large language models in C++
一款通过电影、美剧或文档中的真实语境学习英语单词的应用,让您在原汁原味的情境中记忆词汇,提升学习效率。
Local Qwen3 LLM inference. One easy-to-understand file of C source with no dependencies.
Static suckless single batch CUDA-only qwen3-0.6B mini inference engine
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
开源白板工具(SaaS),一体化白板,包含思维导图、流程图、自由画等。All in one open-source whiteboard tool with mind, flowchart, freehand and etc.
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)
LLaMA 2 implemented from scratch in PyTorch
source code of The Standard C Library, by Plauger
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
🚧 An experimental communicating attention kernel based on DeepEP.
DeepEP: an efficient expert-parallel communication library
A simple, lightweight PowerShell script to remove pre-installed apps, disable telemetry, as well as perform various other changes to customize, declutter and improve your Windows experience. Win11D…