Lists (13)
Sort Name ascending (A-Z)
Stars
🚀 Efficient implementations of state-of-the-art linear attention models
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding code links.
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
A flexible framework powered by ComfyUI for generating personalized Nobel Prize images.
[Support 0.49.x](Reset Cursor AI MachineID & Bypass Higher Token Limit) Cursor Ai ,自动重置机器ID , 免费升级使用Pro功能: You've reached your trial request limit. / Too many free trial accounts used on this machi…
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
LLMs-from-scratch项目中文翻译
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.
NVIDIA Linux open GPU kernel module source
eBPF Developer Tutorial: Learning eBPF Step by Step with Examples
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
[Lumina Embodied AI] 具身智能技术指南 Embodied-AI-Guide
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
Awesome lists about framework figures in papers
A unified inference and post-training framework for accelerated video generation.
flash attention tutorial written in python, triton, cuda, cutlass
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Latest Advances on System-2 Reasoning
Pocket Flow: 100-line LLM framework. Let Agents build Agents!
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training