Differentiable ML with GPU multiplexing across users and nodes
GT is a client-server system for sharing GPU compute across multiple users. It features client-side autograd, server-side scheduling, and distributed workers.
GT uses a clean 3-tier architecture with multi-client support:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Client 1 │ │ Client 2 │ │ Client N │ (Multiple users)
│ (Simple) │ │ (Simple) │ │ (Simple) │
│ │ │ │ │ │
│ • Autograd │ │ • Autograd │ │ • Autograd │
│ tape │ │ tape │ │ tape │
│ • backward()│ │ • backward()│ │ • backward()│
│ • Send cmds │ │ • Send cmds │ │ • Send cmds │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└─────────────────┼─────────────────┘
▼
┌───────────────────────────────────────┐
│ SERVER (Smart, Single) │
│ │
│ • Receives commands from all clients │
│ • Builds DAG per client session │
│ • Schedules execution across workers │
│ • Multiplexes GPUs across users │
│ • Future: fusion, multi-GPU split │
└───────────────┬───────────────────────┘
│
┌───────────────┼───────────────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ...
│Worker 0 │ │Worker 1 │ │Worker 2 │ (1 per GPU)
│(Dumb) │ │(Dumb) │ │(Dumb) │
│ │ │ │ │ │
│GPU 0 │ │GPU 1 │ │GPU 2 │
│cuda:0 │ │cuda:1 │ │cuda:2 │
└─────────┘ └─────────┘ └─────────┘
- Multiple clients, single server: Many users share one GT server daemon
- One worker per GPU: Each GPU gets a dedicated worker process
- Client is dirt simple: Just creates command descriptors and tracks autograd tape
- Autograd is client-side: Graph construction happens locally, execution on server
- Server is smart: Compilation, optimization, scheduling, and multi-user multiplexing
- Workers are dumb: Just execute kernels when told
import gt
# Set backend (numpy or torch)
gt.set_backend('torch', device='cpu')
# Build computation graph
x = gt.randn(128, 256, requires_grad=True)
w = gt.randn(256, 512, requires_grad=True)
y = x @ w
# Forward pass - commands sent to server
result = y.data # shape: (128, 512)
# Backward pass - client generates gradient operations
y.backward()
# Gradients computed on server, available on client
print(f"x.grad shape: {x.grad.data.shape}") # (128, 256)
print(f"w.grad shape: {w.grad.data.shape}") # (256, 512)What happens under the hood:
- Auto-connection: GT tries to connect to daemon at
localhost:29501 - Auto-spawn: If no daemon found, spawns local server automatically
- Command streaming: Operations are sent as commands to server
- Lazy execution: Server schedules and executes when
.datais accessed - Gradient generation:
backward()walks graph and creates gradient ops
# Basic installation (NumPy backend)
pip install -e .
# With PyTorch backend (GPU support)
pip install -e ".[torch]"
# Development tools
pip install -e ".[dev]"
# For running examples
pip install -e ".[examples]"# Start daemon with default settings (8 workers)
gt-server &
# Custom worker count
gt-server --workers 4 &
# CPU-only mode (no GPU workers)
gt-server --no-workers &
# Remote access (bind to all interfaces)
gt-server --host 0.0.0.0 --port 29501 &import gt
# No explicit connection needed!
# GT automatically:
# 1. Tries to connect to localhost:29501
# 2. If not found, spawns local server
# 3. Starts executing operations
x = gt.randn(1024, 1024)
y = x @ x
print(y.data.shape) # (1024, 1024)GT is designed for multiple users sharing GPUs:
# Start server once (with N GPUs = N workers)
gt-server &
# Multiple users connect automatically
python examples/multi_user.py user1 & # Client 1
python examples/multi_user.py user2 & # Client 2
python examples/multi_user.py user3 & # Client 3Architecture in action:
- 1 Server: Receives commands from all clients, schedules work
- N Workers: One worker per GPU (Worker 0 → GPU 0, Worker 1 → GPU 1, etc.)
- M Clients: Multiple users share the GPU pool fairly
Each client gets fair scheduling across available GPUs.
import gt
# Explicitly connect to remote server
gt.connect(host='gpu-server.example.com', port=29501)
# Use normally
x = gt.randn(100, 100)
y = x @ x
print(y.data)
# Disconnect when done
gt.disconnect()GT supports distributed workers across multiple nodes:
# On head node (has access to clients)
gt-server --host 0.0.0.0 --port 29501 &# On gpu-node-1
gt-worker --server head-node:29501 --gpu 0 &
# On gpu-node-2
gt-worker --server head-node:29501 --gpu 0 &
gt-worker --server head-node:29501 --gpu 1 &
# On gpu-node-3 (with custom worker ID)
gt-worker --server head-node:29501 --gpu 0 --worker-id node3-gpu0 &Workers connect to server via TCP and register themselves as available for work.
# Random tensors
x = gt.randn(m, n, requires_grad=True)
# Ones
y = gt.ones(m, n)
# From data (numpy array or list)
z = gt.tensor([[1, 2], [3, 4]], requires_grad=True)# Matrix operations
y = x @ w # Matrix multiplication
z = x + y # Element-wise addition
z = x - y # Element-wise subtraction
z = x * y # Element-wise multiplication
# Activations
z = gt.relu(x) # ReLU activation
# Reductions
loss = gt.mean(x) # Mean reduction
# Loss functions
loss = gt.mse_loss(pred, target) # MSE loss
# Utilities
y = gt.transpose(x) # Matrix transpose# Enable gradients
x = gt.randn(m, n, requires_grad=True)
w = gt.randn(n, k, requires_grad=True)
# Forward pass
y = x @ w
loss = gt.mean(y)
# Backward pass (client-side graph walk, generates gradient ops)
loss.backward()
# Access gradients
print(x.grad.data) # Gradient w.r.t. x
print(w.grad.data) # Gradient w.r.t. w# Set compute backend
gt.set_backend('torch', device='cpu') # PyTorch CPU
gt.set_backend('torch', device='cuda') # PyTorch GPU
gt.set_backend('numpy') # NumPy
# Set logging verbosity (0-3)
gt.set_verbosity(2)
# Connection management
gt.connect(host='remote-host', port=29501)
gt.disconnect()import gt
gt.set_backend('torch', device='cpu')
# Matrix multiplication
x = gt.randn(4, 8)
w = gt.randn(8, 16)
y = x @ w
print(y.data.shape) # (4, 16)import gt
gt.set_backend('torch', device='cpu')
# Simple MLP
class MLP:
def __init__(self):
self.W1 = gt.randn(784, 128, requires_grad=True)
self.W2 = gt.randn(128, 10, requires_grad=True)
def forward(self, x):
h = gt.relu(x @ self.W1)
return h @ self.W2
model = MLP()
# Training loop
for epoch in range(10):
# Forward
x = gt.randn(32, 784) # Batch of 32
target = gt.randn(32, 10)
pred = model.forward(x)
loss = gt.mse_loss(pred, target)
# Backward
loss.backward()
print(f"Epoch {epoch}, Loss: {loss.data.item()}")Full example: examples/train_mlp.py
# examples/multi_user.py
import gt
import sys
user_id = sys.argv[1] if len(sys.argv) > 1 else "user1"
gt.set_backend('torch', device='cpu')
print(f"[{user_id}] Building graph...")
x = gt.randn(4, 4)
w = gt.randn(4, 4)
y = x @ w
print(f"[{user_id}] Result: {y.data.shape}")Run: python examples/multi_user.py user1 &
The client maintains a local computation graph:
class Tensor:
def __init__(self, node_id, input_tensors, grad_fn):
self.node_id = node_id # Unique ID for this node
self.input_tensors = input_tensors # Parent nodes (for backprop)
self.grad_fn = grad_fn # Function to compute gradients
self._grad = None # Accumulated gradient
def backward(self):
"""Walk graph in reverse topological order"""
# 1. Build topological order
# 2. Initialize output gradient
# 3. Backpropagate: compute input grads from output grads
# 4. Each gradient is a NEW operation sent to serverKey insight: Gradients are computed by creating new tensor operations that are sent to the server. The client doesn't do any actual computation - it just tracks the graph structure.
The server maintains a dependency graph:
class ExecutorServer:
def __init__(self):
self.nodes = {} # node_id -> {inputs, op_type, data, output}
def _register_node(self, node_id, inputs, op_type, data):
"""Client registers a new operation"""
self.nodes[node_id] = {
'inputs': inputs,
'op_type': op_type,
'data': data,
'executed': False
}
def _schedule_nodes(self):
"""Execute nodes when inputs are ready"""
for node_id, node in self.nodes.items():
if node['executed']:
continue
# Check if all inputs are ready
inputs_ready = all(
self.nodes[inp_id]['executed']
for inp_id in node['inputs']
)
if inputs_ready:
self._execute_node(node_id, node)GT uses a tape system that makes scheduling completely observable:
Client Operations → Input Tape → Scheduler → Output Queue → Workers
Three layers:
- Input Tape: Raw operations as received from clients (append-only log)
- Output Queue: Scheduled operations with GPU placement + MOVE ops injected
- Handles: Where each tensor currently lives (GPU tracking)
Scheduling algorithm:
- Leaf nodes (randn, ones): Round-robin across GPUs
- Compute nodes (matmul, add): Run on GPU where most inputs already live
- MOVE operations automatically injected when data on wrong GPU
Why tapes?
- Human-debuggable: Can print/inspect at any time
- Testable: Verify constraints on output queue
- Observable: See exactly what scheduler decided
Workers connect to server via TCP and execute tasks asynchronously:
# Worker registration
{
'type': 'worker_register',
'worker_id': 'node1-gpu0',
'gpu_id': 0,
'device': 'cuda:0',
'hostname': 'gpu-node-1'
}
# Task dispatch
{
'type': 'task',
'task_id': 123,
'op_type': 'matmul',
'inputs': [tensor1, tensor2],
'gpu_id': 0
}
# Result
{
'type': 'result',
'task_id': 123,
'worker_id': 'node1-gpu0',
'output': result_tensor,
'status': 'success'
}Features:
- Async execution: Workers execute independently
- Round-robin selection: Distributes work across idle workers on target GPU
- Fallback to local: Server executes locally if no workers available
gt-project/
├── gt/
│ ├── __init__.py # Client API (Tensor, operations, autograd)
│ ├── server/
│ │ ├── server.py # Server daemon (tape-based scheduler)
│ │ ├── client.py # Client connection handler
│ │ ├── transport.py # Transport abstraction (TCP/UCX/SHM)
│ │ └── protocol.py # Wire protocol definitions
│ └── worker/
│ ├── engines/
│ │ ├── numpy.py # NumPy backend
│ │ └── torch.py # PyTorch backend
│ └── standalone.py # Remote worker process
├── examples/
│ ├── basic.py # Simple operations
│ ├── multi_user.py # Multi-user sharing
│ ├── train_mlp.py # MLP training with autograd
│ ├── test_tape_demo.py # Tape system demo
│ ├── test_tape_multi_gpu.py # Multi-GPU with MOVE ops
│ └── test_scheduler_debug.py # Debug mode with slow ticks
├── scripts/
│ └── visualize_trace.py # Debug trace visualizer
├── tests/
│ ├── test_basic.py # Basic operation tests
│ ├── test_execution.py # Async execution tests
│ ├── test_worker_dispatch.py # Worker dispatch tests
│ └── test_tape_scheduling.py # Scheduler constraint tests
├── docs/
│ ├── TAPE_SYSTEM.md # Tape-based scheduler architecture
│ ├── DEBUG_MODE.md # Debug mode documentation
│ ├── TESTING_TAPE_SYSTEM.md # Testing strategies
│ └── TRANSPORT.md # Transport layer design
└── pyproject.toml # Package configuration
- Client simplicity: No computation on client - just graph tracking
- Server intelligence: All optimization, compilation, scheduling happens here
- Worker dumbness: Workers just execute kernels - no decisions
- Separation of concerns:
- Autograd = client (just math/graph state)
- Compilation = server (smart stuff)
- Execution = workers (dumb kernel calls)
GT includes comprehensive debug mode for observing scheduler behavior:
from gt.server.server import ExecutorServer
server = ExecutorServer(
debug=True,
debug_dir='/tmp/gt_debug'
)
server.tick_rate_ms = 100 # Slow ticks for visibility
server.start()# Print tick-by-tick trace
server.print_tick_trace()
# Print tape system state
server.print_input_tape() # What clients sent
server.print_output_queue() # How server scheduled it
server.print_handles() # Where data lives
# Save complete trace to JSON
trace_file = server.save_debug_trace()
# /tmp/gt_debug/trace_1234567890.json# Run debug test with slow ticks
python examples/test_scheduler_debug.py
# Visualize saved trace
python scripts/visualize_trace.py /tmp/gt_debug/trace_*.jsonOutput shows:
- Tick-by-tick execution (when each node executed)
- Input tape (raw client operations)
- Output queue (scheduled operations with GPU placement)
- MOVE operations injected for cross-GPU data transfer
- Worker busy/idle status at each tick
Use cases:
- Understand scheduling decisions
- Debug performance issues
- Verify correctness
- Test scheduler changes
- Educational: see how distributed schedulers work
See docs/DEBUG_MODE.md for complete documentation.
- Tape-based scheduler with GPU affinity
- Worker dispatch with async execution
- Debug mode with tick-by-tick tracing
- Server-side graph optimization (fusion, reordering)
- Multi-GPU operation splitting (tensor parallelism)
- Smart worker selection and load balancing
- Persistent computation caching
- Gradient checkpointing
- Mixed precision training
- UCX transport with RDMA support
MIT
Contributions welcome! Key areas:
- Server optimization: Graph fusion, operator reordering
- Worker efficiency: Better kernel implementations
- Scheduling: Smarter work distribution
- Operations: More ops (conv2d, pooling, etc.)