sleepwalker2017

fade_away sleepwalker2017

6 followers · 19 following

Achievements

Starred repositories

vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Python 2,089 256 Updated Oct 14, 2025

meta-pytorch / tritonbench

Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.

Python 254 45 Updated Oct 15, 2025

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,142 81 Updated Aug 28, 2025

DD-DuDa / Cute-Learning

Examples of CUDA implementations by Cutlass CuTe

Makefile 241 33 Updated Jul 1, 2025

NVIDIA / recsys-examples

Examples for Recommenders - easy to train and deploy on accelerated infrastructure.

Python 153 33 Updated Oct 14, 2025

wzhe06 / SparrowRecSys

A Deep Learning Recommender System

Python 2,659 865 Updated Jun 2, 2024

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 888 98 Updated Oct 15, 2025

MekkCyber / CutlassAcademy

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

232 11 Updated May 6, 2025

FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.

Python 9,367 584 Updated Oct 28, 2024

NEO-MLSys25 / NEO

NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading

Python 64 13 Updated Jun 16, 2025

openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators

C++ 3,595 663 Updated Oct 15, 2025

NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

C++ 1,035 205 Updated Sep 15, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,795 714 Updated Oct 14, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,804 882 Updated Sep 30, 2025

zhaochenyang20 / Awesome-ML-SYS-Tutorial

My learning notes/codes for ML SYS.

Python 3,872 234 Updated Oct 6, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,012 796 Updated Sep 19, 2025

deepseek-ai / DeepSeek-V3

Python 99,674 16,268 Updated Aug 28, 2025

wangzyon / NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

Cuda 387 50 Updated Mar 30, 2022

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 427 46 Updated May 14, 2025

ColfaxResearch / cutlass-kernels

Cuda 240 37 Updated Jul 11, 2024

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,551 230 Updated Oct 9, 2025

chenjiandongx / awesome-asyncio-cn

😎 Python Asyncio 精选资源列表，囊括了网络框架，库，软件等资源

Makefile 642 112 Updated Sep 15, 2019

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 60,156 10,560 Updated Oct 15, 2025

OhMyAgnes / ncu-practice

Practices for Nsight Compute profiling (Command Line)

Cuda 1 Updated Jan 19, 2022

cwpearson / nvidia-performance-tools

Instructions, Docker images, and examples for Nsight Compute and Nsight Systems

Cuda 133 22 Updated May 19, 2020

JuliaLang / julia

The Julia Programming Language

Julia 47,825 5,654 Updated Oct 15, 2025

antgroup / glake

GLake: optimizing GPU memory management and IO transmission.

Python 481 43 Updated Mar 24, 2025

meta-pytorch / applied-ai

Applied AI experiments and examples for PyTorch

Python 299 29 Updated Aug 22, 2025

gpu-mode / lectures

Material for gpu-mode lectures

Jupyter Notebook 5,167 515 Updated Sep 23, 2025

databricks / megablocks

Python 1,462 216 Updated Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fade_away sleepwalker2017

Achievements

Achievements

Block or report sleepwalker2017

Starred repositories

vllm-project / llm-compressor

meta-pytorch / tritonbench

bytedance / flux

DD-DuDa / Cute-Learning

NVIDIA / recsys-examples

wzhe06 / SparrowRecSys

alibaba / rtp-llm

MekkCyber / CutlassAcademy

FMInference / FlexLLMGen

NEO-MLSys25 / NEO

openxla / xla

NVIDIA-Merlin / HugeCTR

deepseek-ai / DeepGEMM

deepseek-ai / FlashMLA

zhaochenyang20 / Awesome-ML-SYS-Tutorial

xlite-dev / LeetCUDA

deepseek-ai / DeepSeek-V3

wangzyon / NVIDIA_SGEMM_PRACTICE

66RING / tiny-flash-attention

ColfaxResearch / cutlass-kernels

BBuf / how-to-optim-algorithm-in-cuda

chenjiandongx / awesome-asyncio-cn

vllm-project / vllm

OhMyAgnes / ncu-practice

cwpearson / nvidia-performance-tools

JuliaLang / julia

antgroup / glake

meta-pytorch / applied-ai

gpu-mode / lectures

databricks / megablocks

Starred topics

Git