deepseek-ai / DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
See what the GitHub community is most excited about this week.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
LLM training in simple, raw C/CUDA
DeepEP: an efficient expert-parallel communication library
CUDA accelerated rasterization of gaussian splatting
Tile primitives for speedy kernels
GPU accelerated decision optimization
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Instant neural graphics primitives: lightning fast NeRF and more
Causal depthwise conv1d in CUDA, with a PyTorch interface
NCCL Tests
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
how to optimize some algorithm in cuda.
cuVS - a library for vector search and clustering on the GPU
Fast CUDA matrix multiplication from scratch
Published on Nature Machine Intelligence! The first real robot(quadrotor) based on differentiable physics training.
CUDA Kernel Benchmarking Library
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.