GitHub - bwasti/gt: [experimental] multiplexed distributed tensor framework

4 min read Original article β†—

GT

An experimental multiplexing tensor framework for distributed GPU computing.

gt_viz

pip install git+https://github.com/bwasti/gt.git
python -c 'import gt; print(gt.randn(2,2))'

General Idea

The motivation for this project is a rejection of the clunky lock-step paradigm ML researchers tend to use. GT attempts to pull some of the ideas that are present in the decades of development done on multi-core operating systems. It fully embraces dynamic scheduling and heavily asynchronous execution while presenting a familiar eager frontend.

  • Three components
    • N Γ— clients (as many users as you want!)
    • 1 Γ— dispatcher (for coordinating)
    • N Γ— workers (1 per GPU)
  • Everything communicates with a stream of instructions
    • Clients deal with math. They emit (GPU-unaware) pure functional instructions
    • The dispatcher rewrites these instructions on the fly to be GPU-aware and sends them to the workers
    • Workers asynchronously process these instructions, optionally JIT compiling
  • Instruction streams are annotated
    • Clients can send "signals" which allow the dispatcher to more appropriately shard the tensors
    • Dispatchers annotate "hot" paths to give hints to workers about JIT compiling
    • Annotations are supplemented with YAML configs that specify sharding and compilation information
    • Every annotation can be safely ignored, so the same code can run anywhere (just remove the YAML)

Philosophy

  • Small is beautiful.
  • Make each program do one thing well.
  • Build a prototype as soon as possible.
  • Choose portability over efficiency.
  • Store data in flat text files.
  • Use software leverage to your advantage.
  • Use shell scripts to increase leverage and portability.
  • Avoid captive user interfaces.
  • Make every program a filter.

Quick Start

import gt

a = gt.randn(1000, 1000)
b = gt.randn(1000, 1000)
c = a @ b
result = c[:4, :4]
print(result)

It may not look like it, but in the background GT automatically spins up an asynchronous dispatching server and GPU worker.

Features

  • High-performance transport - ZeroMQ (ZMQ) with automatic message batching and efficient DEALER/ROUTER pattern
  • Autograd support - Tape-based automatic differentiation exclusively at the client layer
  • PyTorch-compatible API - Familiar syntax for tensor operations
  • Signal-based sharding - Declarative YAML configuration for distributed training
  • Real-time monitoring - htop-style visualization of worker activity
  • Instruction logging - Debug distributed execution with timeline visualizations
  • AI-assisted development - Optimized for collaboration with AI coding assistants

Documentation

πŸ“š Read the full documentation

Getting Started

Client API

Distributed Training

Workers

Contributing

Examples

See examples/ directory for demonstrations:

  • demo.py - Basic tensor operations
  • signal_demo.py - Signal-based sharding
  • compile_demo.py - Compilation directives
  • debug_demo.py - Debug utilities
  • visualize_demo.py - Instruction tape visualization

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          User Code                              β”‚
β”‚  import gt                                                      β”‚
β”‚  with gt.signal.context('layer1'):                              β”‚
β”‚      x = gt.randn(100, 64)                                      β”‚
β”‚      loss = model(x)                                            β”‚
β”‚      loss.backward()                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ PyTorch-like API + Signal Metadata
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      gt/client/                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚   Tensor     β”‚  β”‚  Autograd   β”‚  β”‚  nn.Module   β”‚            β”‚
β”‚  β”‚ (Remote Data)β”‚  β”‚   (Tape)    β”‚  β”‚  (Layers)    β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ ZMQ (DEALER β†’ ROUTER)
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    gt/dispatcher/                               β”‚
β”‚  β€’ ZMQ ROUTER socket handles all connections                    β”‚
β”‚  β€’ Reads signal configs from YAML                               β”‚
β”‚  β€’ Routes operations based on sharding strategy                 β”‚
β”‚  β€’ Logs instruction stream to file                              β”‚
β”‚  β€’ Handles multiple clients concurrently                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚              β”‚              β”‚ ZMQ (DEALER ← ROUTER)
        β”‚              β”‚              β”‚
    β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚Worker 0β”‚    β”‚Worker 1β”‚    β”‚Worker Nβ”‚ (1 per GPU)
    β”‚PyTorch β”‚    β”‚PyTorch β”‚    β”‚PyTorch β”‚
    β”‚  GPU   β”‚    β”‚  GPU   β”‚    β”‚  GPU   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Optimized for AI Development

GT is designed to be understood, modified, and debugged with AI coding assistants:

  • CLAUDE.md - Detailed architecture documentation for AI assistants
  • Declarative YAML configs - Easy for AI to parse and generate
  • Tape-based debugging - Inspect computation graphs with gt.debug.print_tape()
  • Instruction logging - Track every operation with timestamps
  • Comprehensive test suite - 50+ tests serving as executable specifications

Contributing

Contributions welcome! This is a research prototype focused on simplicity and readability.

See Contributing Guide for development workflow, testing, code style, and PR guidelines.

License

MIT

See License for details.