TinyTinyTPU
A minimal 2×2 systolic-array TPU-style matrix-multiply unit, implemented in SystemVerilog and deployed on FPGA.
This project implements a complete TPU architecture including:
- 2×2 systolic array (4 processing elements)
- Full post-MAC pipeline (accumulator, activation, normalization, quantization)
- UART-based host interface
- Multi-layer MLP inference capability
- FPGA deployment on Basys3 (Xilinx Artix-7)
Resource Usage (Basys3 XC7A35T):

- LUTs: ~1,000 (5% utilization)
- Flip-Flops: ~1,000 (3% utilization)
- DSP48E1: 8 slices
- BRAM: ~10-15 blocks
- Estimated Gate Count: ~25,000 gates
Table of Contents
- Project Overview
- Quick Start
- Simulation & Testing
- FPGA Build & Deployment
- Running Inference
- Project Structure
- Architecture Details
- Open Source Tooling (Yosys/nextpnr)
Project Overview
TinyTinyTPU is an educational implementation of Google's TPU architecture, scaled down to a 2×2 systolic array. It demonstrates:
- Systolic Array Architecture: Data flows horizontally (activations) and vertically (partial sums)
- Diagonal Wavefront Weight Loading: Staggered weight capture for proper systolic timing
- Full MLP Pipeline: Weight FIFO → MMU → Accumulator → Activation → Normalization → Quantization
- Multi-Layer Inference: Supports sequential layer processing with double-buffered activations
Design Philosophy
This is a minimal, educational-scale TPU designed for:
- Learning TPU architecture principles
- Understanding systolic array dataflow
- FPGA prototyping and experimentation
- Small-scale ML inference (2×2 matrices)
For production workloads, scale up the array size (e.g., 256×256 like Google TPU v1).
Quick Start
Prerequisites
For Simulation:
- Verilator 5.022 or later
- Python 3.8+
- cocotb
- GTKWave or Surfer (for waveform viewing)
For FPGA Build:
- Xilinx Vivado 2020.1 or later (for Basys3)
- OR Yosys + nextpnr (open source alternative, see Open Source Tooling)
For Running Inference:
- Basys3 FPGA board
- USB cable for programming
- Python 3.8+ with pyserial
Installation
# Clone the repository git clone <repository-url> cd tinytinyTPU-co # Set up simulation environment cd sim python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
Simulation & Testing
Running Tests
All simulation commands must be run from the sim/ directory:
cd sim # Run all tests make test # Run all tests with waveform generation make test WAVES=1 # Run specific module tests make test_pe make test_mmu make test_mlp make test_uart make test_tpu_system # Run with waveforms make test_pe WAVES=1
Test Coverage
| Test File | Module | Coverage |
|---|---|---|
test_pe.py |
Processing Element | Reset, MAC operations, weight capture |
test_mmu.py |
2×2 Systolic Array | Weight loading, matrix multiply |
test_weight_fifo.py |
Weight FIFO | Push/pop, wraparound |
test_dual_weight_fifo.py |
Dual Weight FIFO | Column independence, skew timing |
test_accumulator.py |
Accumulator | Alignment, buffering, accumulate/overwrite modes |
test_activation_func.py |
Activation Function | ReLU positive/negative/zero cases |
test_normalizer.py |
Normalizer | Gain, bias, shift operations |
test_activation_pipeline.py |
Activation Pipeline | Full pipeline, saturation handling |
test_mlp_integration.py |
MLP Top | Multi-layer MLP inference |
test_uart_controller.py |
UART Controller | Command parsing, response generation |
test_tpu_system.py |
TPU Top | End-to-end system integration |
Viewing Waveforms
# List available waveforms make waves # Open specific waveform make waves MODULE=pe make waves MODULE=mmu make waves MODULE=mlp_top
FPGA Build & Deployment
Hardware Connections
Basys3 Pinout:
- UART RX (B18): Receives commands from PC
- UART TX (A18): Sends responses to PC
- Clock: 100 MHz (onboard oscillator)
- Reset: Center button (BTNC, U18)
- LEDs: Status display (see
fpga/README.mdfor LED modes)
UART Settings:
- Baud Rate: 115200
- Data Bits: 8
- Parity: None
- Stop Bits: 1
Running Inference
Python Host Interface
The project includes a Python driver for communicating with the FPGA:
cd host # Basic inference demo python3 inference_demo.py # Gesture recognition demo (requires trained model) python3 gesture_demo.py # Interactive test python3 test_tpu_driver.py
Inference Demo
The inference_demo.py script demonstrates:
- Loading weights into the TPU
- Loading input activations
- Executing inference
- Reading results
Example Usage:
from tpu_driver import TPUDriver # Connect to FPGA (adjust port as needed) tpu = TPUDriver('/dev/ttyUSB0') # Linux # tpu = TPUDriver('COM3') # Windows # Load 2×2 weight matrix weights = [[1, 2], [3, 4]] tpu.write_weights(weights) # Load 2×2 activation matrix activations = [[5, 6], [7, 8]] tpu.write_activations(activations) # Execute inference tpu.execute() # Read results result = tpu.read_result() print(f"Result: {result}")
Gesture Recognition Demo
The gesture_demo.py script implements a simple gesture classifier:
- Trains a 2-layer MLP on mouse movement data
- Classifies gestures as "Horizontal" or "Vertical"
- Real-time inference on FPGA
Running the Demo:
cd host
python3 gesture_demo.pyModel Training:
cd model python3 train.py # Generates: gesture_model.json
UART Protocol
The TPU uses a simple byte-based UART protocol:
Commands:
0x01: Write Weight (4 bytes: W00, W01, W10, W11)0x02: Write Activation (4 bytes: A00, A01, A10, A11)0x03: Execute (start inference)0x04: Read Result (returns 4 bytes: acc0[31:0])0x05: Read Result Column 1 (returns 4 bytes: acc1[31:0])0x06: Read Status (returns 1 byte: state[3:0] | cycle_cnt[3:0])
See host/tpu_driver.py for full protocol implementation.
Project Structure
tinytinyTPU-co/
├── rtl/ # SystemVerilog RTL source files
│ ├── pe.sv # Processing Element (MAC unit)
│ ├── mmu.sv # 2×2 Matrix Multiply Unit (systolic array)
│ ├── weight_fifo.sv # Single-column weight FIFO
│ ├── dual_weight_fifo.sv # Dual-column weight FIFO with skew
│ ├── accumulator.sv # Top-level accumulator
│ ├── accumulator_align.sv # Column alignment logic
│ ├── accumulator_mem.sv # Double-buffered accumulator memory
│ ├── activation_func.sv # ReLU/ReLU6 activation
│ ├── normalizer.sv # Gain/bias/shift normalization
│ ├── loss_block.sv # L1 loss computation
│ ├── activation_pipeline.sv # Full post-accumulator pipeline
│ ├── unified_buffer.sv # Ready/valid output FIFO
│ ├── mlp_top.sv # Top-level MLP integration
│ ├── tpu_bridge.sv # UART-to-MLP bridge
│ ├── uart_controller.sv # UART command processor
│ ├── uart_rx.sv # UART receiver
│ ├── uart_tx.sv # UART transmitter
│ └── tpu_top.sv # Complete TPU system
│
├── sim/ # Simulation environment
│ ├── Makefile # Build and test automation
│ ├── requirements.txt # Python dependencies
│ ├── tests/ # cocotb Python testbenches
│ │ ├── test_pe.py
│ │ ├── test_mmu.py
│ │ ├── test_weight_fifo.py
│ │ ├── test_dual_weight_fifo.py
│ │ ├── test_accumulator.py
│ │ ├── test_activation_func.py
│ │ ├── test_normalizer.py
│ │ ├── test_activation_pipeline.py
│ │ ├── test_mlp_integration.py
│ │ ├── test_uart_controller.py
│ │ └── test_tpu_system.py
│ └── waves/ # Generated VCD waveforms
│
├── fpga/ # FPGA deployment files
│ ├── basys3_top.sv # Top-level FPGA wrapper
│ ├── basys3.xdc # Pin constraints
│ ├── build_vivado.tcl # Automated build script
│ ├── basys3_top.bit # Generated bitstream
│ └── README.md # FPGA-specific documentation
│
├── host/ # Python host interface
│ ├── tpu_driver.py # TPU communication driver
│ ├── tpu_compiler.py # Model compilation utilities
│ ├── inference_demo.py # Basic inference demo
│ ├── gesture_demo.py # Gesture recognition demo
│ └── test_tpu_driver.py # Driver unit tests
│
├── model/ # ML model training
│ ├── train.py # Model training script
│ └── gesture_model.json # Trained model (JSON format)
│
└── README.md # This file
Architecture Details
Systolic Array Dataflow
PE00 -> PE01 Activations flow horizontally (right)
| |
PE10 -> PE11 Partial sums flow vertically (down)
| |
acc0 acc1 Outputs to accumulator
Weight Loading (Diagonal Wavefront):
- Cycle 0: W10 → col0, no capture
- Cycle 1: W00 → col0 (capture), W11 → col1 (no capture)
- Cycle 2: W01 → col1 (capture)
Activation Flow:
- Row 0: A00 → PE00 → PE01
- Row 1: A10 → PE10 → PE11 (with 1-cycle skew)
Pipeline Stages
- Weight FIFO: Stores weights, outputs with column skew
- MMU (Systolic Array): Matrix multiply-accumulate
- Accumulator: Aligns columns, double-buffered storage
- Activation Pipeline:
- Activation function (ReLU/ReLU6)
- Normalization (gain × bias + shift)
- Quantization (int8 with saturation)
- Unified Buffer: Output FIFO with ready/valid handshaking
Multi-Layer MLP
The MLP controller manages sequential layer processing:
State Machine:
IDLE → LOAD_WEIGHT → LOAD_ACT → COMPUTE → DRAIN → TRANSFER → NEXT_LAYER → WAIT_WEIGHTS → ...
- Double Buffering: Activations ping-pong between buffers for layer-to-layer transfer
- Weight Loading: Weights loaded per layer via UART
- Pipeline Overlap: While layer N drains, layer N+1 weights can be loaded
Open Source Tooling (Yosys/nextpnr)
Overview
While Vivado is the standard toolchain for Xilinx FPGAs, open-source alternatives exist:
- Yosys: Synthesis (RTL → netlist)
- nextpnr: Place & Route (netlist → bitstream)
Setup
Installation (Ubuntu/Debian):
# Install Yosys sudo apt-get install yosys # Install nextpnr (for Xilinx 7-series) # Requires building from source - see nextpnr documentation git clone https://github.com/YosysHQ/nextpnr.git cd nextpnr cmake . -DARCH=xilinx make -j$(nproc) sudo make install
Installation (macOS):
brew install yosys
# nextpnr requires manual buildBuilding with Yosys/nextpnr
Step 1: Synthesis (Yosys)
cd fpga # Create synthesis script cat > synth.ys << 'EOF' # Read RTL files read_verilog -sv ../rtl/pe.sv read_verilog -sv ../rtl/mmu.sv read_verilog -sv ../rtl/weight_fifo.sv read_verilog -sv ../rtl/dual_weight_fifo.sv read_verilog -sv ../rtl/accumulator_align.sv read_verilog -sv ../rtl/accumulator_mem.sv read_verilog -sv ../rtl/accumulator.sv read_verilog -sv ../rtl/activation_func.sv read_verilog -sv ../rtl/normalizer.sv read_verilog -sv ../rtl/loss_block.sv read_verilog -sv ../rtl/activation_pipeline.sv read_verilog -sv ../rtl/unified_buffer.sv read_verilog -sv ../rtl/mlp_top.sv read_verilog -sv ../rtl/uart_rx.sv read_verilog -sv ../rtl/uart_tx.sv read_verilog -sv ../rtl/uart_controller.sv read_verilog -sv ../rtl/tpu_bridge.sv read_verilog -sv ../rtl/tpu_top.sv read_verilog -sv basys3_top.sv # Set top module hierarchy -top basys3_top # Synthesize synth_xilinx -top basys3_top -family xc7 # Write netlist write_verilog basys3_top_synth.v write_json basys3_top.json EOF # Run synthesis yosys synth.ys
Step 2: Place & Route (nextpnr)
# Generate bitstream nextpnr-xilinx \ --xdc basys3.xdc \ --json basys3_top.json \ --write basys3_top_routed.json \ --fasm basys3_top.fasm # Generate bitstream (requires Xilinx tools or open-source fasm2bit) # Note: fasm2bit conversion may require Xilinx tools or open-source alternatives
Building with Vivado
The project includes a TCL script for automated Vivado builds:
cd fpga # Build bitstream (synthesis + implementation + bitgen) vivado -mode batch -source build_vivado.tcl # Expected build time: 5-10 minutes # Output: basys3_top.bit
Build Script Details:
- Creates Vivado project:
vivado_project/tinytinyTPU_basys3 - Synthesizes all RTL files from
../rtl/ - Implements design with timing constraints
- Generates bitstream:
basys3_top.bit - Creates reports: utilization, timing, DRC
Resource Utilization (Post-Implementation):
- Check
vivado_project/tinytinyTPU_basys3.runs/impl_1/utilization_post_impl.rpt - Check
vivado_project/tinytinyTPU_basys3.runs/impl_1/timing_summary_post_impl.rpt
Programming the FPGA
Via Vivado Hardware Manager (GUI):
- Connect Basys3 board via USB
- Open Vivado
- Open Hardware Manager
- Auto-connect to target
- Program with
basys3_top.bit
Via Command Line:
vivado -mode tcl
open_hw_manager
connect_hw_server
open_hw_target
set_property PROGRAM.FILE {basys3_top.bit} [get_hw_devices xc7a35t_0]
program_hw_devices [get_hw_devices xc7a35t_0]Via OpenOCD (Alternative):
# If using OpenOCD with Digilent cable openocd -f interface/ftdi/digilent_jtag_hs3.cfg -f target/xc7a35t.cfg # Then use GDB or other tools to program
Limitations & Considerations
Current Status:
- Yosys synthesis works well for most SystemVerilog constructs
- nextpnr supports Xilinx 7-series but may have timing/routing challenges
- Bitstream generation (fasm2bit) may require Xilinx tools or open-source alternatives
Recommendations:
- For development: Use Vivado for reliable builds
- For open-source exploration: Use Yosys for synthesis, verify with Vivado
- For production: Stick with Vivado until open-source toolchain matures
Future Work:
- Create automated Yosys/nextpnr build script
- Document fasm2bit conversion process
- Benchmark open-source vs. Vivado results
Troubleshooting
Simulation Issues
Verilator Errors:
- Ensure Verilator 5.022+ is installed
- Check SystemVerilog syntax (use
make lint)
Test Failures:
- Run with
WAVES=1to generate waveforms for debugging - Check
sim/test_output.logfor detailed error messages
FPGA Build Issues
Synthesis Errors:
- Check RTL files are in
rtl/directory - Verify SystemVerilog syntax (Vivado may be stricter than Verilator)
Timing Violations:
- Check
timing_summary_post_impl.rpt - May need to add pipeline stages or reduce clock frequency
Place & Route Failures:
- Check utilization reports
- Verify constraints in
basys3.xdc
Hardware Issues
UART Not Working:
- Verify COM port:
ls /dev/ttyUSB*(Linux) or Device Manager (Windows) - Check baud rate: 115200
- Verify TX/RX pins in constraints file
LEDs Not Responding:
- Check bitstream programmed correctly
- Verify reset button (center button)
- Check switch settings for LED modes (see
fpga/README.md)
Contributing
Contributions welcome! Areas for improvement:
- Additional test coverage
- Performance optimizations
- Documentation improvements
- Open-source toolchain support
- Larger array sizes
License
MIT License
Copyright (c) 2026 Alan Ma, Abiral Shakya
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
References
Acknowledgments
- Inspired by Google's TPU architecture (thank you Cliff and Richard for your time!)
- The boys from the TinyTPU team!!
- Edmund and the Yosys / Symbiotic EDA crew
- Stanford FAF for the support, funding, and community!
- Princeton ECE Dept for the Basys 3 to play around with :)