distributed. accelerated. done.
zwdx is a distributed deep learning platform that enables GPU sharing and collaborative training across multiple machines. Instead of being limited to your local hardware, zwdx allows you to:
- Distribute training across multiple GPUs from different contributors
- Share your idle GPUs with others who need compute power
- Submit training jobs without managing infrastructure
- Scale seamlessly from single GPU to multi-node distributed training
Unlike traditional distributed training solutions, zwdx separates compute providers (GPU clients) from compute consumers (job submitters), creating a flexible environment for GPU resources.
- Features
- Architecture
- Prerequisites
- Installation
- Setup
- Quick Start
- Configuration
- Monitoring
- Security Considerations
- Limitations
- Legal Notice
- License
- β Distributed Training: Built-in support for DDP / FSDP
- β GPU Pooling: Aggregate GPU resources from multiple machines
- β Room-based Access: Secure, token-based room system for private GPU sharing
- β Framework Support: PyTorch native (additional frameworks coming soon)
- β Simple API: Submit jobs with just a few lines of code
- β Real-time Monitoring: Track training progress and metrics
- β Job Management: Query past jobs
- β Docker-based: Consistent environment across all GPU clients
βββββββββββββββββββ
β zwdx User β
β (Your Code) β
ββββββββββ¬βββββββββ
β
β Submit Job
βΌ
βββββββββββββββββββββββββββββββ
β Server β
β βββββββββββββββββββββββ β
β β Job Pool β β
β βββββββββββββββββββββββ€ β
β β Room Pool β β
β βββββββββββββββββββββββ β
β β Database β β
β βββββββββββββββββββββββ β
ββββββββββββ¬βββββββββββββββββββ
β
β Distribute
β
ββββββββ΄βββββββ¬βββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β GPU β β GPU β β GPU β
β Client β β Client β β Client β
β (Rank 0)β β (Rank 1)β β (Rank 2)β
βββββββββββ βββββββββββ βββββββββββ
How it works:
- Server manages job pool, GPU client pool, room pool and communication
- GPU Clients join rooms using auth tokens and wait for work
- Job Submitters send training jobs to specific rooms
- Results are collected and returned to the job submitter
- Python: 3.12.4
- Docker: 28.4.0
- CUDA: 12.8
- Operating System: Linux, Windows (WSL2)
- Clone the repository:
git clone https://github.com/zenwor/zwdx.git
cd zwdx- Install Python dependencies:
uv pip install --system -r requirements.txtLoad environment variables before proceeding:
cd zwdx/zwdx
source ./setup.shThe server manages GPU clients, job queues, database, and UI communication. No GPUs required on the server machine.
Start the server:
./run_all.sh # inside zwdx/zwdx/What runs:
- Flask server (port 4461)
- Database service (MongoDB, port 5561)
- Web UI (React.js, port 3000)
GPU clients provide computing power to the server.
1. Pull NVIDIA base image and build container:
docker pull pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
docker build -t zwdx_gpu .2. Launch GPU client:
cd zwdx/gpu_client/
./run_gpu_client.sh -rt {ROOM_TOKEN}Arguments:
-rt, --room_token: Room authentication token (required)
Example with specific GPUs:
./run_gpu_client.sh -rt my_room_tokenThe ZWDX interface is designed to be intuitive and Pythonic:
from zwdx import ZWDX
zwdx = ZWDX(server_url="http://localhost:8000")
result = zwdx.submit_job(
model=YourModel(), # PyTorch model instance
data_loader_func=create_data_loader, # Function that returns DataLoader
train_func=train, # Training function
eval_func=eval, # Evaluation function
optimizer=torch.optim.AdamW(...), # Optimizer instance
parallelism="DDP", # Parallelism strategy
memory_required=12_000_000_000, # Minimum GPU memory in bytes
room_token="your_room_token", # Room authentication token
epochs=10, # Number of training epochs
)
# Access results
print(result["job_id"])
print(result["results"]["final_loss"])
# Retrieve trained model
trained_model = zwdx.get_trained_model(result["job_id"], YourModel())For a complete MNIST example, see test/mnist.py.
Edit the .env file or set these environment variables:
# Server
## Flask
FLASK_HOST="0.0.0.0"
FLASK_PORT=4461
MASTER_ADDR="29500"
LT_SUBDOMAIN="zwdx"
LT_PORT=4461
## MongoDB
MONGODB_PORT=5561
MONGODB_DBPATH="./data/"
# Client
SERVER_URL="http://172.17.0.1:4461"tail -f /var/log/zwdx/server.logdocker logs -f zwdx_gpu_client_containerAccess real-time metrics via the web UI at http://localhost:3000 or programmatically:
metrics = zwdx.get_job_metrics(job_id)
print(metrics["gpu_utilization"])
print(metrics["throughput"]) # samples/secondThe web UI provides:
- Job queue and current jobs
- Training metrics and loss curves
- Historical job performance
- Room tokens provide isolation between different groups
- Tokens should be treated as sensitive credentials
- Generate strong, random tokens for production use
- Rotate tokens periodically
- Training data is sent to GPU client machines
- Do not use zwdx for sensitive/confidential data unless you trust all GPU providers
- Model weights are transmitted between clients and server
- Consider using encrypted communication channels for production
- Framework Support: PyTorch only (TensorFlow/JAX planned)
- Parallelism: DDP / FSDP only (more coming soon)
- Data Transfer: Large datasets must be pre-distributed to clients
β οΈ NVIDIA Software UsageThis project uses NVIDIA software. The base container is proprietary and must be pulled by each user separately:
docker pull pytorch/pytorch:2.8.0-cuda12.8-cudnn9-develDo NOT redistribute the NVIDIA container.
See NVIDIA Deep Learning Container License for complete terms.
MIT License - see LICENSE file for details.
Copyright (c) 2025 zenwor
Built with β€οΈ by zenwor