Stars
Assignments for CS146S: The Modern Software Dev (Stanford University Fall 2025)
Dragon distributed runtime for HPC and AI applications and workflows
orchestration for singularity containers (under development)
Portable WDL workflows for CZ ID production pipelines
A distributed storage benchmark for file systems, object stores & block devices with support for GPUs
LD_PRELOAD library to inject O_DIRECT into file I/O
VASTPY is the official Python SDK for the VAST Management System
Persistent remote applications for X11; screen sharing for X11, MacOS and MSWindows.
Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
Automatically split your PyTorch models on multiple GPUs for training & inference
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
A taxonomy of Kubernetes configuration management tools
Reference implementations of MLPerf® training benchmarks
Reference implementations of MLPerf® inference benchmarks
RDMA client/server for transferring files using RDMA over IB
Scaling Data-Constrained Language Models
Scripts and documentation on scaling large language model training on the LUMI supercomputer
Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.
Tool for running/managing ad hoc spark clusters on a Slurm cluster
vCluster - Create fully functional virtual Kubernetes clusters - Each vcluster runs inside a namespace of the underlying k8s cluster. It's cheaper than creating separate full-blown clusters and it …
Monitoring and visualization of InfiniBand Fabrics
High Performance Linpack for GPUs (Using OpenCL, CUDA, CAL)
Optimized primitives for collective multi-GPU communication