Cerebras CS3 Simulator
A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine (WSE). This project models the hardware architecture, interconnects, and execution environment to enable performance analysis and software development for massively parallel 2D mesh architectures.
๐ Project Overview
The Cerebras-Sim is designed to model the CS3 WSE, featuring a massive array of 720,000 processing elements (PEs). It provides a full-stack simulation environment, from a high-level Python DSL down to a custom 32-bit ISA binary.
Key Goals
- Performance Analysis: Estimate total runtime and identify bottlenecks using a hybrid performance/functional model.
- Software Development: Verify kernel correctness via a CUDA-like programming model before deploying to hardware.
- Architectural Exploration: Model the impact of mesh bisection bandwidth, latency, and SRAM constraints.
๐๏ธ Architecture
Hardware Model
- Processing Element (PE): Each core implements an 8-wide SIMD unit, vector registers, and a private 48KB local SRAM.
- Interconnect: A 2D Mesh (800x900) where communication occurs via
SEND/RECVprimitives and global address space abstractions. - Memory Hierarchy:
- Local SRAM: Private high-speed memory per PE (analogous to CUDA Shared Memory).
- Weight Server: External DRAM accessed via a global address space for large-scale model weights and data.
- Host-Device Interface: A driver model implementing a command queue (
CS3Queue) and memory movement (cs3_memcpy).
Execution Model
The simulator employs a Bulk Synchronous Parallel (BSP) model, dividing execution into discrete "supersteps":
- Compute: PEs perform local SIMD operations.
- Communicate: PEs exchange data across the mesh or with the Weight Server.
- Synchronize: A global barrier (
SYNC) aligns the execution state.
To balance accuracy and speed, the simulator uses a hybrid execution track:
- Performance Track (Global): All PEs are tracked for cycle counts and timing.
- Functional Track (Sampled): A stochastic sampling strategy is used where only a subset of blocks is fully simulated functionally to verify correctness.
๐ป Software Stack
The project implements a complete toolchain:
Python DSL Tungsten-IR ISA Binary Simulator
Programming Example
Kernels are written in a CUDA-like Python DSL. For example, a simple SAXPY (
@cs3_kernel(block_w=16, block_h=16) def saxpy_kernel(ctx): # Load inputs from global memory (Weight Server) x = ctx.load_global(None, 0) y = ctx.load_global(None, 4) # Compute: z = 2.0 * x + y z = 2.0 * x + y # Store result back to global memory ctx.store_global(None, 8, z)
- Frontend: A CUDA-like DSL embedded in Python using
@cs3_kerneldecorators. - Intermediate Representation (Tungsten-IR): A dataflow-centric IR mapping compute nodes and synchronization points.
- Compiler Backend:
- Mapping & Scheduling: Assigns IR nodes to the physical 2D mesh and manages the SRAM budget.
- Assembler: Emits the final 32-bit binary stream.
- Simulator Engine: A Python-based engine that decodes the ISA and drives the hardware model.
โฑ๏ธ Performance Modeling
Instead of exhaustive packet-level simulation, the system uses a latency-and-bandwidth-aware abstract model:
-
Latency: Calculated based on physical Manhattan distance:
$$\text{Latency}_{\text{op}} = \text{Base Latency} + (\text{Manhattan Distance} \times \text{Hop Latency})$$ - Bandwidth & Congestion: The simulator enforces a Bisection Bandwidth Constraint. If total bytes transferred per superstep exceed network capacity, a congestion multiplier is applied to "stretch" the superstep duration.
๐ ๏ธ Current Status
| Component | Status | Details |
|---|---|---|
| ISA Decoder | โ Complete | Full implementation of Compute, Mesh, Control, System, DSD, and Global memory opcodes. |
| Hardware Model | โ Functional | Core logic, SRAM, 2D Mesh, and Host-Device IO are implemented. |
| Compiler | โ Functional | AST parsing, IR generation, register allocation, and assembly are operational. |
| Simulation Engine | โ Functional | Hybrid Performance/Functional tracks and BSP scheduling are implemented. |
| Advanced Mapping | ๐ง In Progress | Optimizing spatial mapping for complex kernels. |
| Weight Server | ๐ง In Progress | Integration with external weight servers for real-world model weights. |
๐งช Testing & Validation
The project uses a Dual-Execution strategy to verify the compiler:
-
Python Path: Executes the kernel as a Python function via
KernelContext(Golden Reference). -
Binary Path: Compiles the kernel
$\rightarrow$ executes the resulting binary on theBSPScheduler. - Comparison: Bit-exact comparison of final memory states.
To run integration tests:
python3 -m unittest discover tests/integration