Lists (2)
Sort Name ascending (A-Z)
Starred repositories
Run Slurm on Kubernetes. A Slinky project.
Open Source Continuous Inference Benchmarking - GB200 NVL72 vs MI355X vs B200 vs H200 vs MI325X & soon™ TPUv6e/v7/Trainium2/3/GB300 NVL72 - DeepSeek 670B MoE, GPTOSS
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…
Fast and memory-efficient exact attention
nvloom is a set of tools designed to scalably test MNNVL fabrics.
ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage
Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
DGXC Benchmarking provides recipes in ready-to-use templates for evaluating performance of specific AI use cases across hardware and software combinations.
Continuous Profiling Platform. Debug performance issues down to a single line of code
Perforator is a cluster-wide continuous profiling tool designed for large data centers
My own Prompts for Custom instructions ChatGPT
knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.
Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.
Guides and examples to help achieve optimal performance on a NVIDIA Grace CPU
Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine
Benchmarking guide for the Azure AI Infrastructure.
A distributed storage benchmark for file systems, object stores & block devices with support for GPUs