GT - GPU Time Multiplexer

Differentiable ML with GPU multiplexing across users and nodes

GT is a client-server system for sharing GPU compute across multiple users. It features client-side autograd, server-side scheduling, and distributed workers.

Architecture

GT uses a clean 3-tier architecture with multi-client support:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Client 1    │  │  Client 2    │  │  Client N    │  (Multiple users)
│  (Simple)    │  │  (Simple)    │  │  (Simple)    │
│              │  │              │  │              │
│  • Autograd  │  │  • Autograd  │  │  • Autograd  │
│    tape      │  │    tape      │  │    tape      │
│  • backward()│  │  • backward()│  │  • backward()│
│  • Send cmds │  │  • Send cmds │  │  • Send cmds │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
         ┌───────────────────────────────────────┐
         │         SERVER (Smart, Single)        │
         │                                       │
         │  • Receives commands from all clients │
         │  • Builds DAG per client session      │
         │  • Schedules execution across workers │
         │  • Multiplexes GPUs across users      │
         │  • Future: fusion, multi-GPU split    │
         └───────────────┬───────────────────────┘
                         │
         ┌───────────────┼───────────────────────┐
         ▼               ▼               ▼       ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐  ...
    │Worker 0 │    │Worker 1 │    │Worker 2 │  (1 per GPU)
    │(Dumb)   │    │(Dumb)   │    │(Dumb)   │
    │         │    │         │    │         │
    │GPU 0    │    │GPU 1    │    │GPU 2    │
    │cuda:0   │    │cuda:1   │    │cuda:2   │
    └─────────┘    └─────────┘    └─────────┘

Key Design Principles

Multiple clients, single server: Many users share one GT server daemon
One worker per GPU: Each GPU gets a dedicated worker process
Client is dirt simple: Just creates command descriptors and tracks autograd tape
Autograd is client-side: Graph construction happens locally, execution on server
Server is smart: Compilation, optimization, scheduling, and multi-user multiplexing
Workers are dumb: Just execute kernels when told

Quick Start

import gt

# Set backend (numpy or torch)
gt.set_backend('torch', device='cpu')

# Build computation graph
x = gt.randn(128, 256, requires_grad=True)
w = gt.randn(256, 512, requires_grad=True)
y = x @ w

# Forward pass - commands sent to server
result = y.data  # shape: (128, 512)

# Backward pass - client generates gradient operations
y.backward()

# Gradients computed on server, available on client
print(f"x.grad shape: {x.grad.data.shape}")  # (128, 256)
print(f"w.grad shape: {w.grad.data.shape}")  # (256, 512)

What happens under the hood:

Auto-connection: GT tries to connect to daemon at localhost:29501
Auto-spawn: If no daemon found, spawns local server automatically
Command streaming: Operations are sent as commands to server
Lazy execution: Server schedules and executes when .data is accessed
Gradient generation: backward() walks graph and creates gradient ops

Installation

# Basic installation (NumPy backend)
pip install -e .

# With PyTorch backend (GPU support)
pip install -e ".[torch]"

# Development tools
pip install -e ".[dev]"

# For running examples
pip install -e ".[examples]"

Running as Daemon (Primary Use Case)

Start GT Server

# Start daemon with default settings (8 workers)
gt-server &

# Custom worker count
gt-server --workers 4 &

# CPU-only mode (no GPU workers)
gt-server --no-workers &

# Remote access (bind to all interfaces)
gt-server --host 0.0.0.0 --port 29501 &

Clients Auto-Connect

import gt

# No explicit connection needed!
# GT automatically:
# 1. Tries to connect to localhost:29501
# 2. If not found, spawns local server
# 3. Starts executing operations

x = gt.randn(1024, 1024)
y = x @ x
print(y.data.shape)  # (1024, 1024)

Multi-User Sharing

GT is designed for multiple users sharing GPUs:

# Start server once (with N GPUs = N workers)
gt-server &

# Multiple users connect automatically
python examples/multi_user.py user1 &   # Client 1
python examples/multi_user.py user2 &   # Client 2
python examples/multi_user.py user3 &   # Client 3

Architecture in action:

1 Server: Receives commands from all clients, schedules work
N Workers: One worker per GPU (Worker 0 → GPU 0, Worker 1 → GPU 1, etc.)
M Clients: Multiple users share the GPU pool fairly

Each client gets fair scheduling across available GPUs.

Remote Connection

import gt

# Explicitly connect to remote server
gt.connect(host='gpu-server.example.com', port=29501)

# Use normally
x = gt.randn(100, 100)
y = x @ x
print(y.data)

# Disconnect when done
gt.disconnect()

Multi-Node Workers

GT supports distributed workers across multiple nodes:

Start Server on Head Node

# On head node (has access to clients)
gt-server --host 0.0.0.0 --port 29501 &

Start Workers on GPU Nodes

# On gpu-node-1
gt-worker --server head-node:29501 --gpu 0 &

# On gpu-node-2
gt-worker --server head-node:29501 --gpu 0 &
gt-worker --server head-node:29501 --gpu 1 &

# On gpu-node-3 (with custom worker ID)
gt-worker --server head-node:29501 --gpu 0 --worker-id node3-gpu0 &

Workers connect to server via TCP and register themselves as available for work.

API Reference

Tensor Creation

# Random tensors
x = gt.randn(m, n, requires_grad=True)

# Ones
y = gt.ones(m, n)

# From data (numpy array or list)
z = gt.tensor([[1, 2], [3, 4]], requires_grad=True)

Operations

# Matrix operations
y = x @ w              # Matrix multiplication
z = x + y              # Element-wise addition
z = x - y              # Element-wise subtraction
z = x * y              # Element-wise multiplication

# Activations
z = gt.relu(x)         # ReLU activation

# Reductions
loss = gt.mean(x)      # Mean reduction

# Loss functions
loss = gt.mse_loss(pred, target)  # MSE loss

# Utilities
y = gt.transpose(x)    # Matrix transpose

Autograd

# Enable gradients
x = gt.randn(m, n, requires_grad=True)
w = gt.randn(n, k, requires_grad=True)

# Forward pass
y = x @ w
loss = gt.mean(y)

# Backward pass (client-side graph walk, generates gradient ops)
loss.backward()

# Access gradients
print(x.grad.data)  # Gradient w.r.t. x
print(w.grad.data)  # Gradient w.r.t. w

Configuration

# Set compute backend
gt.set_backend('torch', device='cpu')    # PyTorch CPU
gt.set_backend('torch', device='cuda')   # PyTorch GPU
gt.set_backend('numpy')                  # NumPy

# Set logging verbosity (0-3)
gt.set_verbosity(2)

# Connection management
gt.connect(host='remote-host', port=29501)
gt.disconnect()

Examples

Basic Operations

import gt

gt.set_backend('torch', device='cpu')

# Matrix multiplication
x = gt.randn(4, 8)
w = gt.randn(8, 16)
y = x @ w
print(y.data.shape)  # (4, 16)

MLP Training

import gt

gt.set_backend('torch', device='cpu')

# Simple MLP
class MLP:
    def __init__(self):
        self.W1 = gt.randn(784, 128, requires_grad=True)
        self.W2 = gt.randn(128, 10, requires_grad=True)

    def forward(self, x):
        h = gt.relu(x @ self.W1)
        return h @ self.W2

model = MLP()

# Training loop
for epoch in range(10):
    # Forward
    x = gt.randn(32, 784)  # Batch of 32
    target = gt.randn(32, 10)
    pred = model.forward(x)
    loss = gt.mse_loss(pred, target)

    # Backward
    loss.backward()

    print(f"Epoch {epoch}, Loss: {loss.data.item()}")

Full example: examples/train_mlp.py

Multi-User Sharing

# examples/multi_user.py
import gt
import sys

user_id = sys.argv[1] if len(sys.argv) > 1 else "user1"
gt.set_backend('torch', device='cpu')

print(f"[{user_id}] Building graph...")
x = gt.randn(4, 4)
w = gt.randn(4, 4)
y = x @ w

print(f"[{user_id}] Result: {y.data.shape}")

Run: python examples/multi_user.py user1 &

Architecture Details

Client-Side Autograd

The client maintains a local computation graph:

class Tensor:
    def __init__(self, node_id, input_tensors, grad_fn):
        self.node_id = node_id              # Unique ID for this node
        self.input_tensors = input_tensors  # Parent nodes (for backprop)
        self.grad_fn = grad_fn              # Function to compute gradients
        self._grad = None                   # Accumulated gradient

    def backward(self):
        """Walk graph in reverse topological order"""
        # 1. Build topological order
        # 2. Initialize output gradient
        # 3. Backpropagate: compute input grads from output grads
        # 4. Each gradient is a NEW operation sent to server

Key insight: Gradients are computed by creating new tensor operations that are sent to the server. The client doesn't do any actual computation - it just tracks the graph structure.

Server-Side DAG Scheduling

The server maintains a dependency graph:

class ExecutorServer:
    def __init__(self):
        self.nodes = {}  # node_id -> {inputs, op_type, data, output}

    def _register_node(self, node_id, inputs, op_type, data):
        """Client registers a new operation"""
        self.nodes[node_id] = {
            'inputs': inputs,
            'op_type': op_type,
            'data': data,
            'executed': False
        }

    def _schedule_nodes(self):
        """Execute nodes when inputs are ready"""
        for node_id, node in self.nodes.items():
            if node['executed']:
                continue

            # Check if all inputs are ready
            inputs_ready = all(
                self.nodes[inp_id]['executed']
                for inp_id in node['inputs']
            )

            if inputs_ready:
                self._execute_node(node_id, node)

Tape-Based Scheduler

GT uses a tape system that makes scheduling completely observable:

Client Operations → Input Tape → Scheduler → Output Queue → Workers

Three layers:

Input Tape: Raw operations as received from clients (append-only log)
Output Queue: Scheduled operations with GPU placement + MOVE ops injected
Handles: Where each tensor currently lives (GPU tracking)

Scheduling algorithm:

Leaf nodes (randn, ones): Round-robin across GPUs
Compute nodes (matmul, add): Run on GPU where most inputs already live
MOVE operations automatically injected when data on wrong GPU

Why tapes?

Human-debuggable: Can print/inspect at any time
Testable: Verify constraints on output queue
Observable: See exactly what scheduler decided

Worker Protocol

Workers connect to server via TCP and execute tasks asynchronously:

# Worker registration
{
    'type': 'worker_register',
    'worker_id': 'node1-gpu0',
    'gpu_id': 0,
    'device': 'cuda:0',
    'hostname': 'gpu-node-1'
}

# Task dispatch
{
    'type': 'task',
    'task_id': 123,
    'op_type': 'matmul',
    'inputs': [tensor1, tensor2],
    'gpu_id': 0
}

# Result
{
    'type': 'result',
    'task_id': 123,
    'worker_id': 'node1-gpu0',
    'output': result_tensor,
    'status': 'success'
}

Features:

Async execution: Workers execute independently
Round-robin selection: Distributes work across idle workers on target GPU
Fallback to local: Server executes locally if no workers available

Project Structure

gt-project/
├── gt/
│   ├── __init__.py           # Client API (Tensor, operations, autograd)
│   ├── server/
│   │   ├── server.py         # Server daemon (tape-based scheduler)
│   │   ├── client.py         # Client connection handler
│   │   ├── transport.py      # Transport abstraction (TCP/UCX/SHM)
│   │   └── protocol.py       # Wire protocol definitions
│   └── worker/
│       ├── engines/
│       │   ├── numpy.py      # NumPy backend
│       │   └── torch.py      # PyTorch backend
│       └── standalone.py     # Remote worker process
├── examples/
│   ├── basic.py              # Simple operations
│   ├── multi_user.py         # Multi-user sharing
│   ├── train_mlp.py          # MLP training with autograd
│   ├── test_tape_demo.py     # Tape system demo
│   ├── test_tape_multi_gpu.py  # Multi-GPU with MOVE ops
│   └── test_scheduler_debug.py # Debug mode with slow ticks
├── scripts/
│   └── visualize_trace.py    # Debug trace visualizer
├── tests/
│   ├── test_basic.py         # Basic operation tests
│   ├── test_execution.py     # Async execution tests
│   ├── test_worker_dispatch.py  # Worker dispatch tests
│   └── test_tape_scheduling.py  # Scheduler constraint tests
├── docs/
│   ├── TAPE_SYSTEM.md        # Tape-based scheduler architecture
│   ├── DEBUG_MODE.md         # Debug mode documentation
│   ├── TESTING_TAPE_SYSTEM.md  # Testing strategies
│   └── TRANSPORT.md          # Transport layer design
└── pyproject.toml            # Package configuration

Design Philosophy

Client simplicity: No computation on client - just graph tracking
Server intelligence: All optimization, compilation, scheduling happens here
Worker dumbness: Workers just execute kernels - no decisions
Separation of concerns:
- Autograd = client (just math/graph state)
- Compilation = server (smart stuff)
- Execution = workers (dumb kernel calls)

Debugging & Observability

GT includes comprehensive debug mode for observing scheduler behavior:

Enable Debug Mode

from gt.server.server import ExecutorServer

server = ExecutorServer(
    debug=True,
    debug_dir='/tmp/gt_debug'
)
server.tick_rate_ms = 100  # Slow ticks for visibility
server.start()

Inspect Execution

# Print tick-by-tick trace
server.print_tick_trace()

# Print tape system state
server.print_input_tape()     # What clients sent
server.print_output_queue()   # How server scheduled it
server.print_handles()        # Where data lives

# Save complete trace to JSON
trace_file = server.save_debug_trace()
# /tmp/gt_debug/trace_1234567890.json

Visualize Traces

# Run debug test with slow ticks
python examples/test_scheduler_debug.py

# Visualize saved trace
python scripts/visualize_trace.py /tmp/gt_debug/trace_*.json

Output shows:

Tick-by-tick execution (when each node executed)
Input tape (raw client operations)
Output queue (scheduled operations with GPU placement)
MOVE operations injected for cross-GPU data transfer
Worker busy/idle status at each tick

Use cases:

Understand scheduling decisions
Debug performance issues
Verify correctness
Test scheduler changes
Educational: see how distributed schedulers work

See docs/DEBUG_MODE.md for complete documentation.

Future Work

License

MIT

Contributing

Contributions welcome! Key areas:

Server optimization: Graph fusion, operator reordering
Worker efficiency: Better kernel implementations
Scheduling: Smarter work distribution
Operations: More ops (conv2d, pooling, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
benchmarks		benchmarks
docs		docs
examples		examples
gt		gt
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
test_transport_perf.py		test_transport_perf.py

bwasti/gt3

Folders and files

Latest commit

History

Repository files navigation

GT - GPU Time Multiplexer

Architecture

Key Design Principles

Quick Start

Installation

Running as Daemon (Primary Use Case)

Start GT Server

Clients Auto-Connect

Multi-User Sharing

Remote Connection

Multi-Node Workers

Start Server on Head Node

Start Workers on GPU Nodes

API Reference

Tensor Creation

Operations

Autograd

Configuration

Examples

Basic Operations

MLP Training

Multi-User Sharing

Architecture Details

Client-Side Autograd

Server-Side DAG Scheduling

Tape-Based Scheduler

Worker Protocol

Project Structure

Design Philosophy

Debugging & Observability

Enable Debug Mode

Inspect Execution

Visualize Traces

Future Work

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages