GT
An experimental multiplexing tensor framework for distributed GPU computing.
pip install git+https://github.com/bwasti/gt.git
python -c 'import gt; print(gt.randn(2,2))'General Idea
The motivation for this project is a rejection of the clunky lock-step paradigm ML researchers tend to use. GT attempts to pull some of the ideas that are present in the decades of development done on multi-core operating systems. It fully embraces dynamic scheduling and heavily asynchronous execution while presenting a familiar eager frontend.
- Three components
- N Γ clients (as many users as you want!)
- 1 Γ dispatcher (for coordinating)
- N Γ workers (1 per GPU)
- Everything communicates with a stream of instructions
- Clients deal with math. They emit (GPU-unaware) pure functional instructions
- The dispatcher rewrites these instructions on the fly to be GPU-aware and sends them to the workers
- Workers asynchronously process these instructions, optionally JIT compiling
- Instruction streams are annotated
- Clients can send "signals" which allow the dispatcher to more appropriately shard the tensors
- Dispatchers annotate "hot" paths to give hints to workers about JIT compiling
- Annotations are supplemented with YAML configs that specify sharding and compilation information
- Every annotation can be safely ignored, so the same code can run anywhere (just remove the YAML)
Philosophy
- Small is beautiful.
- Make each program do one thing well.
- Build a prototype as soon as possible.
- Choose portability over efficiency.
- Store data in flat text files.
- Use software leverage to your advantage.
- Use shell scripts to increase leverage and portability.
- Avoid captive user interfaces.
- Make every program a filter.
Quick Start
import gt a = gt.randn(1000, 1000) b = gt.randn(1000, 1000) c = a @ b result = c[:4, :4] print(result)
It may not look like it, but in the background GT automatically spins up an asynchronous dispatching server and GPU worker.
Features
- High-performance transport - ZeroMQ (ZMQ) with automatic message batching and efficient DEALER/ROUTER pattern
- Autograd support - Tape-based automatic differentiation exclusively at the client layer
- PyTorch-compatible API - Familiar syntax for tensor operations
- Signal-based sharding - Declarative YAML configuration for distributed training
- Real-time monitoring - htop-style visualization of worker activity
- Instruction logging - Debug distributed execution with timeline visualizations
- AI-assisted development - Optimized for collaboration with AI coding assistants
Documentation
π Read the full documentation
Getting Started
- Installation - Install GT and verify setup
- Usage Guide - Auto-server mode and distributed setup
Client API
- Tensor Operations - Complete operation reference
- Autograd - Automatic differentiation
Distributed Training
- Signal-Based Sharding - Configure parallelism strategies
- Tuning & Performance - Optimize performance
- Monitoring Tools - Real-time monitoring and debugging
Workers
- Backends - PyTorch and NumPy backends
- Compilation - JIT compilation with torch.compile
Contributing
- Contributing Guide - Development workflow, testing, and PR guidelines
Examples
See examples/ directory for demonstrations:
demo.py- Basic tensor operationssignal_demo.py- Signal-based shardingcompile_demo.py- Compilation directivesdebug_demo.py- Debug utilitiesvisualize_demo.py- Instruction tape visualization
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Code β
β import gt β
β with gt.signal.context('layer1'): β
β x = gt.randn(100, 64) β
β loss = model(x) β
β loss.backward() β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β PyTorch-like API + Signal Metadata
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β gt/client/ β
β ββββββββββββββββ βββββββββββββββ ββββββββββββββββ β
β β Tensor β β Autograd β β nn.Module β β
β β (Remote Data)β β (Tape) β β (Layers) β β
β ββββββββββββββββ βββββββββββββββ ββββββββββββββββ β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β ZMQ (DEALER β ROUTER)
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β gt/dispatcher/ β
β β’ ZMQ ROUTER socket handles all connections β
β β’ Reads signal configs from YAML β
β β’ Routes operations based on sharding strategy β
β β’ Logs instruction stream to file β
β β’ Handles multiple clients concurrently β
βββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββββββββ
β β β ZMQ (DEALER β ROUTER)
β β β
βββββΌβββββ βββββΌβββββ βββββΌβββββ
βWorker 0β βWorker 1β βWorker Nβ (1 per GPU)
βPyTorch β βPyTorch β βPyTorch β
β GPU β β GPU β β GPU β
ββββββββββ ββββββββββ ββββββββββ
Optimized for AI Development
GT is designed to be understood, modified, and debugged with AI coding assistants:
- CLAUDE.md - Detailed architecture documentation for AI assistants
- Declarative YAML configs - Easy for AI to parse and generate
- Tape-based debugging - Inspect computation graphs with
gt.debug.print_tape() - Instruction logging - Track every operation with timestamps
- Comprehensive test suite - 50+ tests serving as executable specifications
Contributing
Contributions welcome! This is a research prototype focused on simplicity and readability.
See Contributing Guide for development workflow, testing, code style, and PR guidelines.
License
MIT
See License for details.
