Skip to content

bwasti/gt3

Repository files navigation

GT - GPU Time Multiplexer

Differentiable ML with GPU multiplexing across users and nodes

GT is a client-server system for sharing GPU compute across multiple users. It features client-side autograd, server-side scheduling, and distributed workers.

Architecture

GT uses a clean 3-tier architecture with multi-client support:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Client 1    │  │  Client 2    │  │  Client N    │  (Multiple users)
│  (Simple)    │  │  (Simple)    │  │  (Simple)    │
│              │  │              │  │              │
│  • Autograd  │  │  • Autograd  │  │  • Autograd  │
│    tape      │  │    tape      │  │    tape      │
│  • backward()│  │  • backward()│  │  • backward()│
│  • Send cmds │  │  • Send cmds │  │  • Send cmds │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
         ┌───────────────────────────────────────┐
         │         SERVER (Smart, Single)        │
         │                                       │
         │  • Receives commands from all clients │
         │  • Builds DAG per client session      │
         │  • Schedules execution across workers │
         │  • Multiplexes GPUs across users      │
         │  • Future: fusion, multi-GPU split    │
         └───────────────┬───────────────────────┘
                         │
         ┌───────────────┼───────────────────────┐
         ▼               ▼               ▼       ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐  ...
    │Worker 0 │    │Worker 1 │    │Worker 2 │  (1 per GPU)
    │(Dumb)   │    │(Dumb)   │    │(Dumb)   │
    │         │    │         │    │         │
    │GPU 0    │    │GPU 1    │    │GPU 2    │
    │cuda:0   │    │cuda:1   │    │cuda:2   │
    └─────────┘    └─────────┘    └─────────┘

Key Design Principles

  1. Multiple clients, single server: Many users share one GT server daemon
  2. One worker per GPU: Each GPU gets a dedicated worker process
  3. Client is dirt simple: Just creates command descriptors and tracks autograd tape
  4. Autograd is client-side: Graph construction happens locally, execution on server
  5. Server is smart: Compilation, optimization, scheduling, and multi-user multiplexing
  6. Workers are dumb: Just execute kernels when told

Quick Start

import gt

# Set backend (numpy or torch)
gt.set_backend('torch', device='cpu')

# Build computation graph
x = gt.randn(128, 256, requires_grad=True)
w = gt.randn(256, 512, requires_grad=True)
y = x @ w

# Forward pass - commands sent to server
result = y.data  # shape: (128, 512)

# Backward pass - client generates gradient operations
y.backward()

# Gradients computed on server, available on client
print(f"x.grad shape: {x.grad.data.shape}")  # (128, 256)
print(f"w.grad shape: {w.grad.data.shape}")  # (256, 512)

What happens under the hood:

  1. Auto-connection: GT tries to connect to daemon at localhost:29501
  2. Auto-spawn: If no daemon found, spawns local server automatically
  3. Command streaming: Operations are sent as commands to server
  4. Lazy execution: Server schedules and executes when .data is accessed
  5. Gradient generation: backward() walks graph and creates gradient ops

Installation

# Basic installation (NumPy backend)
pip install -e .

# With PyTorch backend (GPU support)
pip install -e ".[torch]"

# Development tools
pip install -e ".[dev]"

# For running examples
pip install -e ".[examples]"

Running as Daemon (Primary Use Case)

Start GT Server

# Start daemon with default settings (8 workers)
gt-server &

# Custom worker count
gt-server --workers 4 &

# CPU-only mode (no GPU workers)
gt-server --no-workers &

# Remote access (bind to all interfaces)
gt-server --host 0.0.0.0 --port 29501 &

Clients Auto-Connect

import gt

# No explicit connection needed!
# GT automatically:
# 1. Tries to connect to localhost:29501
# 2. If not found, spawns local server
# 3. Starts executing operations

x = gt.randn(1024, 1024)
y = x @ x
print(y.data.shape)  # (1024, 1024)

Multi-User Sharing

GT is designed for multiple users sharing GPUs:

# Start server once (with N GPUs = N workers)
gt-server &

# Multiple users connect automatically
python examples/multi_user.py user1 &   # Client 1
python examples/multi_user.py user2 &   # Client 2
python examples/multi_user.py user3 &   # Client 3

Architecture in action:

  • 1 Server: Receives commands from all clients, schedules work
  • N Workers: One worker per GPU (Worker 0 → GPU 0, Worker 1 → GPU 1, etc.)
  • M Clients: Multiple users share the GPU pool fairly

Each client gets fair scheduling across available GPUs.

Remote Connection

import gt

# Explicitly connect to remote server
gt.connect(host='gpu-server.example.com', port=29501)

# Use normally
x = gt.randn(100, 100)
y = x @ x
print(y.data)

# Disconnect when done
gt.disconnect()

Multi-Node Workers

GT supports distributed workers across multiple nodes:

Start Server on Head Node

# On head node (has access to clients)
gt-server --host 0.0.0.0 --port 29501 &

Start Workers on GPU Nodes

# On gpu-node-1
gt-worker --server head-node:29501 --gpu 0 &

# On gpu-node-2
gt-worker --server head-node:29501 --gpu 0 &
gt-worker --server head-node:29501 --gpu 1 &

# On gpu-node-3 (with custom worker ID)
gt-worker --server head-node:29501 --gpu 0 --worker-id node3-gpu0 &

Workers connect to server via TCP and register themselves as available for work.

API Reference

Tensor Creation

# Random tensors
x = gt.randn(m, n, requires_grad=True)

# Ones
y = gt.ones(m, n)

# From data (numpy array or list)
z = gt.tensor([[1, 2], [3, 4]], requires_grad=True)

Operations

# Matrix operations
y = x @ w              # Matrix multiplication
z = x + y              # Element-wise addition
z = x - y              # Element-wise subtraction
z = x * y              # Element-wise multiplication

# Activations
z = gt.relu(x)         # ReLU activation

# Reductions
loss = gt.mean(x)      # Mean reduction

# Loss functions
loss = gt.mse_loss(pred, target)  # MSE loss

# Utilities
y = gt.transpose(x)    # Matrix transpose

Autograd

# Enable gradients
x = gt.randn(m, n, requires_grad=True)
w = gt.randn(n, k, requires_grad=True)

# Forward pass
y = x @ w
loss = gt.mean(y)

# Backward pass (client-side graph walk, generates gradient ops)
loss.backward()

# Access gradients
print(x.grad.data)  # Gradient w.r.t. x
print(w.grad.data)  # Gradient w.r.t. w

Configuration

# Set compute backend
gt.set_backend('torch', device='cpu')    # PyTorch CPU
gt.set_backend('torch', device='cuda')   # PyTorch GPU
gt.set_backend('numpy')                  # NumPy

# Set logging verbosity (0-3)
gt.set_verbosity(2)

# Connection management
gt.connect(host='remote-host', port=29501)
gt.disconnect()

Examples

Basic Operations

import gt

gt.set_backend('torch', device='cpu')

# Matrix multiplication
x = gt.randn(4, 8)
w = gt.randn(8, 16)
y = x @ w
print(y.data.shape)  # (4, 16)

MLP Training

import gt

gt.set_backend('torch', device='cpu')

# Simple MLP
class MLP:
    def __init__(self):
        self.W1 = gt.randn(784, 128, requires_grad=True)
        self.W2 = gt.randn(128, 10, requires_grad=True)

    def forward(self, x):
        h = gt.relu(x @ self.W1)
        return h @ self.W2

model = MLP()

# Training loop
for epoch in range(10):
    # Forward
    x = gt.randn(32, 784)  # Batch of 32
    target = gt.randn(32, 10)
    pred = model.forward(x)
    loss = gt.mse_loss(pred, target)

    # Backward
    loss.backward()

    print(f"Epoch {epoch}, Loss: {loss.data.item()}")

Full example: examples/train_mlp.py

Multi-User Sharing

# examples/multi_user.py
import gt
import sys

user_id = sys.argv[1] if len(sys.argv) > 1 else "user1"
gt.set_backend('torch', device='cpu')

print(f"[{user_id}] Building graph...")
x = gt.randn(4, 4)
w = gt.randn(4, 4)
y = x @ w

print(f"[{user_id}] Result: {y.data.shape}")

Run: python examples/multi_user.py user1 &

Architecture Details

Client-Side Autograd

The client maintains a local computation graph:

class Tensor:
    def __init__(self, node_id, input_tensors, grad_fn):
        self.node_id = node_id              # Unique ID for this node
        self.input_tensors = input_tensors  # Parent nodes (for backprop)
        self.grad_fn = grad_fn              # Function to compute gradients
        self._grad = None                   # Accumulated gradient

    def backward(self):
        """Walk graph in reverse topological order"""
        # 1. Build topological order
        # 2. Initialize output gradient
        # 3. Backpropagate: compute input grads from output grads
        # 4. Each gradient is a NEW operation sent to server

Key insight: Gradients are computed by creating new tensor operations that are sent to the server. The client doesn't do any actual computation - it just tracks the graph structure.

Server-Side DAG Scheduling

The server maintains a dependency graph:

class ExecutorServer:
    def __init__(self):
        self.nodes = {}  # node_id -> {inputs, op_type, data, output}

    def _register_node(self, node_id, inputs, op_type, data):
        """Client registers a new operation"""
        self.nodes[node_id] = {
            'inputs': inputs,
            'op_type': op_type,
            'data': data,
            'executed': False
        }

    def _schedule_nodes(self):
        """Execute nodes when inputs are ready"""
        for node_id, node in self.nodes.items():
            if node['executed']:
                continue

            # Check if all inputs are ready
            inputs_ready = all(
                self.nodes[inp_id]['executed']
                for inp_id in node['inputs']
            )

            if inputs_ready:
                self._execute_node(node_id, node)

Tape-Based Scheduler

GT uses a tape system that makes scheduling completely observable:

Client Operations → Input Tape → Scheduler → Output Queue → Workers

Three layers:

  1. Input Tape: Raw operations as received from clients (append-only log)
  2. Output Queue: Scheduled operations with GPU placement + MOVE ops injected
  3. Handles: Where each tensor currently lives (GPU tracking)

Scheduling algorithm:

  • Leaf nodes (randn, ones): Round-robin across GPUs
  • Compute nodes (matmul, add): Run on GPU where most inputs already live
  • MOVE operations automatically injected when data on wrong GPU

Why tapes?

  • Human-debuggable: Can print/inspect at any time
  • Testable: Verify constraints on output queue
  • Observable: See exactly what scheduler decided

Worker Protocol

Workers connect to server via TCP and execute tasks asynchronously:

# Worker registration
{
    'type': 'worker_register',
    'worker_id': 'node1-gpu0',
    'gpu_id': 0,
    'device': 'cuda:0',
    'hostname': 'gpu-node-1'
}

# Task dispatch
{
    'type': 'task',
    'task_id': 123,
    'op_type': 'matmul',
    'inputs': [tensor1, tensor2],
    'gpu_id': 0
}

# Result
{
    'type': 'result',
    'task_id': 123,
    'worker_id': 'node1-gpu0',
    'output': result_tensor,
    'status': 'success'
}

Features:

  • Async execution: Workers execute independently
  • Round-robin selection: Distributes work across idle workers on target GPU
  • Fallback to local: Server executes locally if no workers available

Project Structure

gt-project/
├── gt/
│   ├── __init__.py           # Client API (Tensor, operations, autograd)
│   ├── server/
│   │   ├── server.py         # Server daemon (tape-based scheduler)
│   │   ├── client.py         # Client connection handler
│   │   ├── transport.py      # Transport abstraction (TCP/UCX/SHM)
│   │   └── protocol.py       # Wire protocol definitions
│   └── worker/
│       ├── engines/
│       │   ├── numpy.py      # NumPy backend
│       │   └── torch.py      # PyTorch backend
│       └── standalone.py     # Remote worker process
├── examples/
│   ├── basic.py              # Simple operations
│   ├── multi_user.py         # Multi-user sharing
│   ├── train_mlp.py          # MLP training with autograd
│   ├── test_tape_demo.py     # Tape system demo
│   ├── test_tape_multi_gpu.py  # Multi-GPU with MOVE ops
│   └── test_scheduler_debug.py # Debug mode with slow ticks
├── scripts/
│   └── visualize_trace.py    # Debug trace visualizer
├── tests/
│   ├── test_basic.py         # Basic operation tests
│   ├── test_execution.py     # Async execution tests
│   ├── test_worker_dispatch.py  # Worker dispatch tests
│   └── test_tape_scheduling.py  # Scheduler constraint tests
├── docs/
│   ├── TAPE_SYSTEM.md        # Tape-based scheduler architecture
│   ├── DEBUG_MODE.md         # Debug mode documentation
│   ├── TESTING_TAPE_SYSTEM.md  # Testing strategies
│   └── TRANSPORT.md          # Transport layer design
└── pyproject.toml            # Package configuration

Design Philosophy

  1. Client simplicity: No computation on client - just graph tracking
  2. Server intelligence: All optimization, compilation, scheduling happens here
  3. Worker dumbness: Workers just execute kernels - no decisions
  4. Separation of concerns:
    • Autograd = client (just math/graph state)
    • Compilation = server (smart stuff)
    • Execution = workers (dumb kernel calls)

Debugging & Observability

GT includes comprehensive debug mode for observing scheduler behavior:

Enable Debug Mode

from gt.server.server import ExecutorServer

server = ExecutorServer(
    debug=True,
    debug_dir='/tmp/gt_debug'
)
server.tick_rate_ms = 100  # Slow ticks for visibility
server.start()

Inspect Execution

# Print tick-by-tick trace
server.print_tick_trace()

# Print tape system state
server.print_input_tape()     # What clients sent
server.print_output_queue()   # How server scheduled it
server.print_handles()        # Where data lives

# Save complete trace to JSON
trace_file = server.save_debug_trace()
# /tmp/gt_debug/trace_1234567890.json

Visualize Traces

# Run debug test with slow ticks
python examples/test_scheduler_debug.py

# Visualize saved trace
python scripts/visualize_trace.py /tmp/gt_debug/trace_*.json

Output shows:

  • Tick-by-tick execution (when each node executed)
  • Input tape (raw client operations)
  • Output queue (scheduled operations with GPU placement)
  • MOVE operations injected for cross-GPU data transfer
  • Worker busy/idle status at each tick

Use cases:

  • Understand scheduling decisions
  • Debug performance issues
  • Verify correctness
  • Test scheduler changes
  • Educational: see how distributed schedulers work

See docs/DEBUG_MODE.md for complete documentation.

Future Work

  • Tape-based scheduler with GPU affinity
  • Worker dispatch with async execution
  • Debug mode with tick-by-tick tracing
  • Server-side graph optimization (fusion, reordering)
  • Multi-GPU operation splitting (tensor parallelism)
  • Smart worker selection and load balancing
  • Persistent computation caching
  • Gradient checkpointing
  • Mixed precision training
  • UCX transport with RDMA support

License

MIT

Contributing

Contributions welcome! Key areas:

  1. Server optimization: Graph fusion, operator reordering
  2. Worker efficiency: Better kernel implementations
  3. Scheduling: Smarter work distribution
  4. Operations: More ops (conv2d, pooling, etc.)

About

gpu time multiplexer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages