distributed. accelerated. done.
zwdx is a distributed deep learning platform that enables GPU sharing and collaborative training across multiple machines. Instead of being limited to your local hardware, zwdx allows you to:
- Distribute training across multiple GPUs from different contributors
- Share your idle GPUs with others who need compute power
- Submit training jobs without managing infrastructure
- Scale seamlessly from single GPU to multi-node distributed training
Unlike traditional distributed training solutions, zwdx separates compute providers (GPU clients) from compute consumers (job submitters), creating a flexible environment for GPU resources.
- Features
- Architecture
- Prerequisites
- Installation
- Setup
- Quick Start
- Configuration
- Monitoring
- Security Considerations
- Limitations
- Legal Notice
- License
- ✅ Distributed Training: Built-in support for DDP / FSDP
- ✅ GPU Pooling: Aggregate GPU resources from multiple machines
- ✅ Room-based Access: Secure, token-based room system for private GPU sharing
- ✅ Framework Support: PyTorch native (additional frameworks coming soon)
- ✅ Simple API: Submit jobs with just a few lines of code
- ✅ Real-time Monitoring: Track training progress and metrics
- ✅ Job Management: Query past jobs
- ✅ Docker-based: Consistent environment across all GPU clients
┌─────────────────┐
│ zwdx User │
│ (Your Code) │
└────────┬────────┘
│
│ Submit Job
▼
┌─────────────────────────────┐
│ Server │
│ ┌─────────────────────┐ │
│ │ Job Pool │ │
│ ├─────────────────────┤ │
│ │ Room Pool │ │
│ ├─────────────────────│ │
│ │ Database │ │
│ └─────────────────────┘ │
└──────────┬──────────────────┘
│
│ Distribute
│
┌──────┴──────┬──────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ GPU │ │ GPU │ │ GPU │
│ Client │ │ Client │ │ Client │
│ (Rank 0)│ │ (Rank 1)│ │ (Rank 2)│
└─────────┘ └─────────┘ └─────────┘
How it works:
- Server manages job pool, GPU client pool, room pool and communication
- GPU Clients join rooms using auth tokens and wait for work
- Job Submitters send training jobs to specific rooms
- Results are collected and returned to the job submitter
- Python: 3.12.4
- Docker: 28.4.0
- CUDA: 12.8
- Operating System: Linux, Windows (WSL2)
- Clone the repository:
git clone https://github.com/zenwor/zwdx.git
cd zwdx- Install Python dependencies:
uv pip install --system -r requirements.txtLoad environment variables before proceeding:
cd zwdx/zwdx
source ./setup.shThe server manages GPU clients, job queues, database, and UI communication. No GPUs required on the server machine.
Start the server:
./run_all.sh # inside zwdx/zwdx/What runs:
- Flask server (port 4461)
- Database service (MongoDB, port 5561)
- Web UI (React.js, port 3000)
GPU clients provide computing power to the server.
1. Pull NVIDIA base image and build container:
docker pull pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
docker build -t zwdx_gpu .2. Launch GPU client:
cd zwdx/gpu_client/
./run_gpu_client.sh -rt {ROOM_TOKEN}Arguments:
-rt, --room_token: Room authentication token (required)
Example with specific GPUs:
./run_gpu_client.sh -rt my_room_tokenThe ZWDX interface is designed to be intuitive and Pythonic:
from zwdx import ZWDX
zwdx = ZWDX(server_url="http://localhost:8000")
result = zwdx.submit_job(
model=YourModel(), # PyTorch model instance
data_loader_func=create_data_loader, # Function that returns DataLoader
train_func=train, # Training function
eval_func=eval, # Evaluation function
optimizer=torch.optim.AdamW(...), # Optimizer instance
parallelism="DDP", # Parallelism strategy
memory_required=12_000_000_000, # Minimum GPU memory in bytes
room_token="your_room_token", # Room authentication token
epochs=10, # Number of training epochs
)
# Access results
print(result["job_id"])
print(result["results"]["final_loss"])
# Retrieve trained model
trained_model = zwdx.get_trained_model(result["job_id"], YourModel())For a complete MNIST example, see test/mnist.py.
Edit the .env file or set these environment variables:
# Server
## Flask
FLASK_HOST="0.0.0.0"
FLASK_PORT=4461
MASTER_ADDR="29500"
LT_SUBDOMAIN="zwdx"
LT_PORT=4461
## MongoDB
MONGODB_PORT=5561
MONGODB_DBPATH="./data/"
# Client
SERVER_URL="http://172.17.0.1:4461"tail -f /var/log/zwdx/server.logdocker logs -f zwdx_gpu_client_containerAccess real-time metrics via the web UI at http://localhost:3000 or programmatically:
metrics = zwdx.get_job_metrics(job_id)
print(metrics["gpu_utilization"])
print(metrics["throughput"]) # samples/secondThe web UI provides:
- Job queue and current jobs
- Training metrics and loss curves
- Historical job performance
- Room tokens provide isolation between different groups
- Tokens should be treated as sensitive credentials
- Generate strong, random tokens for production use
- Rotate tokens periodically
- Training data is sent to GPU client machines
- Do not use zwdx for sensitive/confidential data unless you trust all GPU providers
- Model weights are transmitted between clients and server
- Consider using encrypted communication channels for production
- Framework Support: PyTorch only (TensorFlow/JAX planned)
- Parallelism: DDP / FSDP only (more coming soon)
- Data Transfer: Large datasets must be pre-distributed to clients
⚠️ NVIDIA Software UsageThis project uses NVIDIA software. The base container is proprietary and must be pulled by each user separately:
docker pull pytorch/pytorch:2.8.0-cuda12.8-cudnn9-develDo NOT redistribute the NVIDIA container.
See NVIDIA Deep Learning Container License for complete terms.
MIT License - see LICENSE file for details.
Copyright (c) 2025 zenwor
Built with ❤️ by zenwor