EP4CE6 Bitstream Reverse Engineering
What Is This Project?
The goal of this project is to fully reverse-engineer the bitstream format of the Altera (now Intel) Cyclone IV FPGA chip EP4CE6F17C8.
What Is an FPGA?
For readers new to FPGAs: a bitstream is the file that configures the
chip's programmable logic — for Altera parts, a .rbf (Raw Binary File).
Background: Cyclone IV device handbook.
Why Reverse-Engineer the Bitstream?
Commercial FPGA vendors (Intel/Altera, Xilinx/AMD) keep their bitstream formats proprietary. You must use their own tools (e.g., Quartus) to generate a bitstream. This means:
- No open-source toolchain: You cannot use open-source synthesizers (like Yosys) or place-and-route tools (like NextPNR) to go all the way from Verilog to a bitstream.
- No insight into chip internals: You don't know which part of the chip each bit in the bitstream controls.
- Dependency on closed-source software: Quartus is free but not open-source, and only supports certain operating systems.
Once we reverse-engineer the bitstream format we can:
- Build a completely open-source FPGA toolchain for the EP4CE6
- Understand how the chip's CRAM (Configuration RAM) is organized internally
- Directly read and write logic configuration and routing information in bitstream files
Pioneer Projects
| Project | Target chip | Core contribution | Relation to this project |
|---|---|---|---|
| Project IceStorm | Lattice iCE40 | End-to-end open toolchain: fuzzer → icebox chip database → icepack/icetime bitstream tools |
Original methodology template (black-box pair-diff fuzzing) |
| Project X-Ray | Xilinx 7-series | Defined the FASM intermediate format and the specimen-fuzzer harness pattern | FASM format adopted here |
| Project Mistral | Altera Cyclone V | Derived the Routing Bit Mask (RBM) model from quartus_cdb + custom Tcl passes |
Same chip family; our LI MUX model descends from Mistral's RBM work |
| Project Trellis | Lattice ECP5 | Diamond-driven fuzzing with routing-bit decomposition; integrates with nextpnr-ecp5 | Routing decomposition strategy reference |
Hardware and Software Environment
Hardware
- Development board: Heijin AX301
- FPGA chip: EP4CE6F17C8 (Cyclone IV E series, 6,272 logic elements)
- Programmer: USB-Blaster JTAG
Software
- Quartus Prime 21.1 Lite Edition: Intel's free FPGA development tool
- Installation path:
~/intelFPGA_lite/21.1/quartus/bin/ - Command-line tools used:
quartus_map(synthesis),quartus_fit(place & route),quartus_asm(generate .sof),quartus_cpf(convert to .rbf),quartus_sta(static timing analysis)
- Installation path:
- openFPGALoader: open-source FPGA programming tool (flashes bitstream to board); use
$HOME/see_neorv32_run_linux/tools/openFPGALoader/build/openFPGALoader— the system version does not recognize the EP4CE6 IDCODE - Python 3: all fuzzing scripts are written in Python
- SQLite: database for storing experiment results
EP4CE6 Chip Geometry
EP4CE6F17C8 internal layout (simplified):
X=3 4 6 7 8 10 11 12 13 16 17 18 19 21 22 23 24 25 26 28 29 31
Y=21 [LAB][LAB][LAB][LAB][LAB][LAB]... [LAB]
Y=19 [LAB][LAB][LAB][LAB][LAB][LAB]... [LAB]
... | | | | | | |
Y=2 [LAB][LAB][LAB][LAB][LAB][LAB]... [LAB]
^ ^ ^ ^
X=5 X=9 X=14-15 X=20,27
M9K M9K DSP M9K
RAM RAM Multiplier RAM
- 392 LABs (Logic Array Blocks), each containing 16 LEs (Logic Elements)
- LAB X coordinates: 22 values
[3, 4, 6, 7, 8, 10, 11, 12, 13, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, 31]- Note that X is not contiguous! X=5, 9, 14, 15, 20, 27, 30 are occupied by M9K memory, DSP multipliers, or PLLs.
- LAB Y coordinates: 18 values
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 21]- Y=15 and Y=20 do not exist ("ghost rows" in CRAM)
- LE N index: 16 even values
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30] - Total: 392 × 16 = 6,272 LEs
Each LE contains:
- A 4-input lookup table (LUT4): can implement any 4-variable Boolean function
- A D flip-flop (DFF): optionally used
- Carry-chain logic (for arithmetic operations like addition)
Core Methodology: "Pair-Diff"
This is the core method of the entire reverse-engineering effort — simple yet powerful.
Basic Idea
If you want to know which bits in the bitstream control a specific feature, compile two designs that differ only in that feature, then compare their bitstreams. The differing bits are the ones that control that feature.
Concrete Steps (LUT Truth Table Example)
Step 1: Compile a LUT with truth table set to all zeros (mask = 0x0000)
→ produces zero.rbf
Step 2: Compile the same LUT at the same location with all ones (mask = 0xFFFF)
→ produces ones.rbf
Step 3: Compare zero.rbf and ones.rbf bit by bit
→ the differences are the CRAM bits for this LE's truth table
Why Does This Work?
Because the two designs are identical in every way except the LUT's truth table — same routing, same I/O buffers, same global config. The differing bits can only be the truth-table encoding.
Comparison Method Hierarchy
| Method | What is compared | Effect | Noise |
|---|---|---|---|
| Design vs. empty | functional vs. non-functional | Finds all related bits | High (includes routing, etc.) |
| Pair-diff | mask=0x0000 vs. mask=0xFFFF | Only finds LUT TT bits | Zero noise |
| Multi-mask cross | multiple designs with different masks | Validates XOR-linear model | Zero |
How Is It Implemented?
# rbf_diff.py — compare two RBF files def diff_rbf(rbf_a: bytes, rbf_b: bytes) -> list[BitDiff]: diffs = [] for i in range(RBF_SIZE): # iterate over 368,011 bytes xor = rbf_a[i] ^ rbf_b[i] # XOR to find differing bytes if xor: for bit in range(8): # check each bit if xor & (1 << bit): direction = 1 if (rbf_b[i] >> bit) & 1 else -1 diffs.append(BitDiff(i, bit, direction)) return diffs
Each BitDiff records three values:
byte_offset: byte offset in the RBF file (0 to 368,010)bit_position: bit position within that byte (0=LSB, 7=MSB)direction: change direction (+1 means 0→1, -1 means 1→0)
RBF File Format
An EP4CE6 RBF file is always exactly 368,011 bytes, regardless of design complexity:
┌──────────────────────────┐
│ Preamble │ 32 bytes, all 0xFF
├──────────────────────────┤
│ │
│ Config Data │ 367,920 bytes
│ Contains CRAM content │ All logic and routing encoded here
│ │
├──────────────────────────┤
│ Postamble │ 59 bytes, all 0xFF
└──────────────────────────┘
Key regions:
0x0020 – 0x0028: Device header (constant:6A F7 F7 F7 F7 F7 F7 F3 FB)0x0029 – 0x0034: Design-dependent data (12 bytes, possibly resource-usage encoding)0x0049 – 0x004A: CRC/checksum (changes with every modification)0x004B – 0x59BBB: CRAM configuration data body
Project Directory Structure
EP4CE6/
├── README.md ← This file (English)
├── README_zh.md ← Chinese version
├── CLAUDE.md ← AI assistant context/memory file
├── fuzz/ ← Fuzzing pipeline (Python source, 96 modules)
│ ├── config.py ← EP4CE6 constants, coordinates, pin definitions
│ ├── verilog_gen.py ← Verilog code generator
│ ├── qsf_gen.py ← Quartus project config file generator
│ ├── compile.py ← Quartus headless compilation driver
│ ├── rbf_diff.py ← Bit-level binary diff engine
│ ├── database.py ← SQLite database interface
│ ├── runner.py ← Fuzzing experiment orchestrator (main entry)
│ ├── analyze.py ← Result analysis and visualization
│ ├── bitstream.py ← Bitstream codec (LUT + RouteCodec + CRC patcher)
│ ├── route_synth.py ← Green-island route synthesis engine
│ ├── fasm2rbf.py / rbf2fasm.py ← Phase 4 FASM writer + reverse tool
│ └── route_signatures.py / route_decompose.py ← sig backend + set-cover
├── synth/ ← Open-source toolchain (Yosys + nextpnr-generic)
│ ├── ep4ce6_map.v ← Cyclone IV techmap (LUT4/DFF primitives)
│ ├── prims.v ← nextpnr-generic primitive library
│ ├── m9k.lib ← M9K BRAM library stub
│ ├── synth_ep4ce6.ys ← Yosys synthesis script (NEORV32 source paths use $HOME)
│ ├── synth_ep4ce6.sh ← wrapper — run this instead of the .ys; envsubst's $HOME / $NEORV32_ROOT
│ └── np2fasm.py ← nextpnr routed JSON → FASM converter
├── jailbreak/ ← CE10 fitter probes (X=32/33, Y=15 dead-cell scans)
├── results/
│ ├── rbf/ ← Collected .rbf files (~2,500 files, 368 KB each)
│ ├── fingerprint_*.json ← 15 green-zone island corpora
│ ├── route_cells_full.json ← 13,487 sig-cache (7-tuple, Plan D' + legacy)
│ ├── route_cells_consolidated.json ← port-MUX consolidated loader
│ ├── nv_fingerprints/ ← NEORV32 per-source fingerprints
│ ├── r4_iindex_table.json ← 942-entry R4 I-index hint table
│ ├── ep4ce6_bitdb.sqlite ← Bit-mapping database
│ └── FINDINGS.md ← Detailed findings report
└── work/ ← Quartus temporary build directory (can be cleaned)
Source Code Statistics (core modules)
| File | Lines | Function |
|---|---|---|
config.py |
184 | Chip constants, CRAM address formulas, pin definitions |
verilog_gen.py |
363 | Verilog generation (LUT/FF/LI/route/jailbreak templates) |
qsf_gen.py |
81 | QSF project configuration generation |
compile.py |
270 | Quartus compilation driver + STA routing extraction |
rbf_diff.py |
111 | Binary comparison engine |
database.py |
193 | SQLite database operations |
runner.py |
1,305 | Experiment orchestrator (largest file) |
analyze.py |
570 | Analysis, visualization, and codec commands |
bitstream.py |
1,264 | LutCodec + RouteCodec + CRC patcher |
route_synth.py |
398 | Green-island route synthesis |
fasm2rbf.py |
239 | Phase 4 FASM → RBF bitgen |
rbf2fasm.py |
175 | Phase 4 RBF → FASM reverse tool |
| Core total | ~5,150 | (+ 84 mining/analysis/test modules) |
Code Architecture
1. config.py — Chip Constants
This file defines all physical parameters of the EP4CE6:
# Chip geometry LAB_X = [3, 4, 6, 7, 8, 10, 11, 12, 13, 16, 17, 18, 19, 21, ...] # 22 LAB columns LAB_Y = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, ...] # 18 LAB rows LE_N = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30] # 16 LEs # CRAM column base addresses (22 columns, each at a different start) COLUMN_BASE = { 3: 0x076E0, # first LAB column 4: 0x09396, # = 0x076E0 + 7350 (standard step) 6: 0x0CD02, # = 0x09396 + 14700 (skip M9K column) ... } # Pin assignments (matching GPIO on AX301 dev board) FUZZ_PINS = { "A": "PIN_E16", # Button KEY2 → LUT input A "B": "PIN_M16", # Button KEY3 → LUT input B "C": "PIN_M15", # Button KEY4 → LUT input C "D": "PIN_E15", # Reset key → LUT input D "Q": "PIN_G15", # LED[0] → LUT output }
The most important part is the CRAM address model functions:
def cram_ctrl_addr(x, y, pair, n=0): """Compute CRAM control-byte address for pair 'pair' at location (X, Y, N)""" cram_row = y - 2 # Y coordinate mapped to CRAM row number slot = cram_row % 3 # Groups of 3 rows; slot = 0, 1, or 2 group = cram_row // 3 # Group index (0–6) # ... compute offset
2. verilog_gen.py — Verilog Generator
Generates minimal Verilog designs, each containing just one or two LUTs:
def gen_lut4_primitive(mask: int) -> str: """Instantiate a LUT using the Cyclone IV primitive, controlling its 16-bit truth table.""" return f""" cycloneive_lcell_comb #( .lut_mask(16'h{mask:04X}), // truth table, e.g. 0x8888 = A & B .dont_touch("on") // tell Quartus not to optimize it away ) lut_inst ( .dataa(A), .datab(B), .datac(C), .datad(D), .combout(lut_out) );"""
Why use a primitive rather than a behavioral description?
A behavioral description (assign Q = A & B;) lets Quartus decide how to implement the logic — the synthesizer may optimize, merge, or reorder LUTs. By directly instantiating the cycloneive_lcell_comb primitive, you can precisely control every bit of the 16-bit truth table, which is critical for reverse engineering.
Available generator functions:
| Function | Purpose | Notes |
|---|---|---|
gen_lut4(expr) |
Behavioral LUT | Boolean expression description |
gen_lut4_primitive(mask) |
Primitive-level LUT | Precise truth-table control |
gen_two_luts_primitive(m1, m2) |
Two connected LUTs | For routing fuzzing |
gen_single_lut_primitive_extra_inputs(m) |
Single LUT + 7 input ports | Routing fuzzing baseline |
gen_lut4_ff(expr) |
LUT + flip-flop | For DFF fuzzing |
gen_empty() |
Empty design | Global baseline |
3. qsf_gen.py — QSF Generator
A QSF (Quartus Settings File) is Quartus's project configuration file. This module generates QSF files with critical settings:
# Disable all optimization — this is key to fuzzing success! QSF_OPTIMIZATIONS_OFF = [ ('AUTO_RAM_RECOGNITION', 'OFF'), # Don't infer RAM ('AUTO_DSP_RECOGNITION', 'OFF'), # Don't infer DSP ('AUTO_SHIFT_REGISTER_RECOGNITION', 'OFF'),# Don't infer shift registers ('SYNTH_TIMING_DRIVEN_SYNTHESIS', 'OFF'), # No timing-driven synthesis ('ROUTER_TIMING_OPTIMIZATION_LEVEL', 'MINIMUM'), # Minimize routing optimization ... ] # Force LUT placement to specified location placement = {"lut_inst": "LCCOMB_X10_Y10_N0"} # Generates in QSF: # set_location_assignment LCCOMB_X10_Y10_N0 -to "lut_inst"
Why disable optimization? Quartus's optimizer changes routing paths. If two compilations of the same design choose different paths, our diff will contain routing noise. After setting ROUTER_TIMING_OPTIMIZATION_LEVEL MINIMUM, routing becomes fully deterministic — the same design always produces an identical bitstream.
4. compile.py — Quartus Compilation Driver
Wraps the Quartus command-line toolchain:
Quartus compilation flow:
Verilog quartus_map quartus_fit quartus_asm quartus_cpf
source ─────────────► ─────────────► ─────────────► ─────────────► .rbf
(synthesis) (place&route) (gen .sof) (conv to .rbf)
Key implementation details:
def compile_and_export(project_name, verilog, qsf, rbf_output): """One-stop: create project → compile → export RBF""" proj_dir = setup_project(project_name, verilog, qsf) # write files ok, elapsed, err = compile_full(project_name, proj_dir) # run Quartus if ok: rbf = generate_rbf(project_name, proj_dir, rbf_output) # .sof → .rbf return rbf, elapsed, err def extract_routing(project_name, proj_dir): """Extract routing paths via static timing analysis""" # Runs Tcl script calling report_timing -show_routing # Parses output to get wire names on each path
Important: RBF generation must use quartus_cpf -c -o bitstream_compression=off; do NOT use sof2rbf.py (which produces invalid bitstreams).
5. runner.py — Fuzzing Experiment Orchestrator
The largest file (~1,226 lines), orchestrating all fuzzing experiments. Main commands:
# Generate baseline RBF python3 runner.py baseline # LUT truth-table fuzzing at a single LE location python3 runner.py --node lut_inst lut_single 10 10 0 # Args: X=10, Y=10, N=0 # Sweep all 16 minterms (single-bit TT patterns) python3 runner.py n_sweep 10 10 # Grid sweep pair-diff across all 22 columns python3 runner.py pair_diff_grid # Parallel routing fuzzing python3 runner.py route_map_parallel 10 5 col # Args: source X=10, source Y=5, direction=column # Batch routing fuzzing (multiple source positions) python3 runner.py route_map_batch --sources 4,10 29,10 10,17 --direction row --jobs 4
6. database.py — SQLite Database
All experiment results are stored in an SQLite database:
-- Experiment records CREATE TABLE experiments ( id INTEGER PRIMARY KEY, name TEXT, -- experiment name, e.g. "lut_single_X10_Y10_N0" verilog TEXT, -- Verilog source code (stored in full) qsf_placement TEXT, -- placement constraints compile_time REAL, -- compilation time (seconds) rbf_path TEXT -- path to RBF file ); -- Bit mappings (core data) CREATE TABLE bit_mapping ( x INTEGER, -- LAB X coordinate y INTEGER, -- LAB Y coordinate n INTEGER, -- LE index feature TEXT, -- feature name, e.g. "lut_tt_0x0001" byte_offset INTEGER, -- byte offset in RBF bit_position INTEGER, -- bit position within that byte direction INTEGER, -- change direction (+1 or -1) PRIMARY KEY (x, y, n, feature, byte_offset, bit_position) ); -- Routing paths CREATE TABLE routing_paths ( src_x, src_y, src_n, -- source LE coordinates dst_x, dst_y, dst_n, -- destination LE coordinates path_json TEXT -- wire path (JSON format) );
Current database statistics:
- 1,961 experiments
- 708,319 bit-mapping records
- 980 routing paths (including complete wire paths from STA extraction)
- 95 distinct features
Reverse-Engineering Results: Phase by Phase
Phase 1: Building the Fuzzing Pipeline
Goal: Build an automated compile → compare → record workflow.
Acceptance criteria: Place a LUT at a specified coordinate, compile an RBF, and find bit differences between two different designs.
Key steps:
- Compile empty design → get
baseline.rbf(bitstream when all LUTs are "absent") - Place an
A & BLUT at (X=10, Y=10, N=0) → getand.rbf - Diff → ~450 bit differences (LUT config + routing)
- Use Pair-Diff: same position, mask=0x0000 vs. mask=0xFFFF → only 64 bits differ → pure LUT truth table!
Compilation time: ~9–10 sec (synthesis + place-and-route + RBF generation), throughput ~360–400 per hour.
Phase 2: Cracking Logic Configuration
Phase 2.1: LUT Truth-Table Encoding
Discovery: LUT truth tables use XOR-linear encoding.
What does this mean? A simplified example:
Suppose a 2-input LUT has 4 truth-table bits (TT[0] to TT[3]). With a "direct" encoding, each TT bit corresponds to one CRAM bit. But Cyclone IV's encoding is more complex — each TT bit maps to 8–10 CRAM bits, and those CRAM bits have an XOR relationship.
Single-bit patterns (CRAM bits from minterm pair-diffs):
TT bit 0 (mask 0x0001) → {A1, B3, B5, C2, C7, D1, D4, E6} ← 8 CRAM bits
TT bit 1 (mask 0x0002) → {A1, B3, B5, C2, C7, D2, D5, E7} ← same 8 positions
↑ ↑ ↑ ↑ ↑ 5 shared!
These bits are shared between the two
CRAM bits for any mask = XOR(CRAM bit sets for each '1' bit in the mask)
Example: mask 0x0003 (both bit 0 and bit 1 are 1)
= {A1,B3,B5,C2,C7,D1,D4,E6} XOR {A1,B3,B5,C2,C7,D2,D5,E7}
= {D1,D2,D4,D5,E6,E7} ← shared bits XOR out
Validation: Compiled with multiple masks (0xFFFF, 0x8888, 0x6996, etc.); XOR-linear predictions match actual diffs exactly.
Truth-Table CRAM Structure
The 16-bit truth table for each LE is encoded in 8 CRAM byte pairs:
Each pair contains:
┌──────────────────────────────────────────────────────────┐
│ ctrl_lo (1 byte) ← control byte (low), identifies Y row │
│ ctrl_hi (1 byte) ← control byte (high), adjacent to lo │
│ data_0 (1 byte) ← data byte 0, +48 bytes past ctrl │
│ data_1 (1 byte) ← data byte 1, adjacent to data_0 │
└──────────────────────────────────────────────────────────┘
ctrl → data offset: 48 bytes (fixed)
pair → pair spacing: 210 bytes (fixed)
8 pairs × 210 bytes ≈ 1,518-byte CRAM span
Pair-to-TT-bit mapping (pair number → TT bit index):
Pair 0: TT[7] (lo byte), TT[15] (hi byte)
Pair 1: TT[6] (hi byte), TT[14] (lo byte)
Pair 2: TT[5] (lo byte), TT[13] (hi byte)
Pair 3: TT[4] (hi byte), TT[12] (lo byte)
Pair 4: TT[3] (lo byte), TT[11] (hi byte)
Pair 5: TT[2] (hi byte), TT[10] (lo byte)
Pair 6: TT[1] (lo byte), TT[9] (hi byte)
Pair 7: TT[0] (hi byte), TT[8] (lo byte)
Formula: pair = 7 - (bit % 8), byte side alternates
Phase 2.2: CRAM Address Model (376/376 positions verified — 100%)
This is the most fundamental discovery of the entire reverse-engineering effort — a complete mapping formula from (X, Y, N) coordinates to CRAM byte addresses.
Column Base Addresses
The bitstream is organized by column; each LAB column occupies a contiguous CRAM region:
Standard LAB column width: 7,350 bytes (0x1CB6)
Col X=3 Col X=4 Col X=6 Col X=7
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ 7,350 │ │ 7,350 │ │ 7,350 │ │ 7,350 │ ...
│ bytes │ │ bytes │ │ bytes │ │ bytes │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
0x076E0 0x09396 0x0CD02 0x0E9B8
+7,350 +14,700 +7,350
(skip M9K)
Non-standard column widths appear at special resource boundaries:
- M9K RAM (X=5, 9, 20, 27, 30): need extra space
- DSP multipliers (X=14–15): huge CRAM region (76,230-byte jump)
- PLL: near X=27
Y Address Formula (slot/group Encoding)
This is the most intricate part. The 18 Y coordinates do not map to consecutive addresses; they use a "three-row interleaved" encoding:
cram_row = Y - 2 # Y=2 → 0, Y=3 → 1, ..., Y=21 → 19 slot = cram_row % 3 # groups of 3 rows; slot = 0, 1, or 2 group = cram_row // 3 # group index, 0 to 6 # slot determines the base offset SLOT_BASE = {0: 136, 1: 0, 2: 70} # bytes # group determines the fine offset and bit position byte_offset = SLOT_BASE[slot] + group * 3 bit_position = 7 - group - (1 if slot > 0 else 0)
Why this seemingly complex encoding? Because Cyclone IV's CRAM is physically scanned by row, and each CRAM byte must serve the switches for multiple Y rows. slot determines the physical location; group determines which bit within the byte.
Complete Y mapping table:
| Y | slot | group | byte offset | bit | Note |
|---|---|---|---|---|---|
| 2 | 0 | 0 | +136 | bit-7 | bottom edge |
| 3 | 1 | 0 | +0 | bit-6 | |
| 4 | 2 | 0 | +70 | bit-6 | |
| 5 | 0 | 1 | +140 | bit-6 | |
| 6 | 1 | 1 | +3 | bit-5 | |
| 7 | 2 | 1 | +73 | bit-5 | |
| 8 | 0 | 2 | +143 | bit-5 | |
| 9 | 1 | 2 | +6 | bit-4 | |
| 10 | 2 | 2 | +76 | bit-4 | |
| 11 | 0 | 3 | +146 | bit-4 | |
| 12 | 1 | 3 | +9 | bit-3 | |
| 13 | 2 | 3 | +79 | bit-3 | |
| 14 | 0 | 4 | +149 | bit-3 | |
| 16 | 2 | 4 | +82 | bit-2 | Y=15 skipped |
| 17 | 0 | 5 | +152 | bit-2 | |
| 18 | 1 | 5 | +15 | bit-1 | |
| 19 | 2 | 5 | +85 | bit-1 | |
| 21 | 1 | 6 | +18 | bit-0 | Y=20 skipped, top edge |
N (LE Index) Address Formula
The 16 LEs within a LAB have addresses that decrease by the following pattern:
N=0 → offset 0 (reference)
N=2 → -2
N=4 → -8
N=6 → -10
N=8 → -16
N=10 → -18
N=12 → -24
N=14 → -26
N=16 → -38 (crosses LAB midpoint boundary, extra -12)
N=18 → -40
... and so on
Step sequence: -2, -6, -2, -6, -2, -6, -2, -12, -2, -6, -2, -6, -2, -6, -2
Formula: delta(N) = -(half * 38) - (kh // 2) * 8 - (kh % 2) * 2
where k = N/2, half = k//8, kh = k%8
Complete Address Calculation Example
Problem: Where in the RBF is the ctrl byte of LUT TT pair 3 for the LE at (X=10, Y=10, N=6)?
1. Look up column base: COLUMN_BASE[10] = 0x13FDA = 81,882
2. Compute period_start = 81,882 - 136 = 81,746
3. Y=10: cram_row=8, slot=2, group=2
slot_base[2] = 70, offset = 70 + 2*3 = 76
4. pair=3 → pair offset = 3 * 210 = 630
5. N=6: k=3, half=0, kh=3 → delta = -(1*8 + 1*2) = -10
6. Final address = 81,746 + 76 + 630 + (-10) = 82,442 = 0x1422A
7. Bit position = (6 - 2) = 4, i.e. bit-4
Phase 2.3: DFF (D Flip-Flop) Configuration
Challenge: Quartus Lite rejects placement constraints of the form LCFF_Xx_Yy_Nn. Unlike LUTs, we cannot precisely control where flip-flops are placed.
Solution:
- Use specific output pins to "attract" the flip-flop to the target column (Quartus automatically places FFs near the LAB closest to the output pin)
- Use pair-diff to isolate FF-related bits
Findings:
- Each LE has 4 FF pairs (vs. 8 for LUT), using the same ctrl+data structure
- FF pairs are split to either side of the LUT TT region (half below, half above)
- Adding a basic DFF changes ~362 bits, of which ~90% are routing and ~10% are LE config
- FF mode bits (async reset / sync enable): 82 shared mode bits + feature-specific routing
Phase 2.4: Arithmetic Mode (initial claim — later revised)
Early mining compiled an adder using a + b with VIRTUAL_PIN and, by diffing against a normal-mode LUT, reported "~92 arithmetic/carry-chain bits per LE" distributed on both sides of the LUT TT region. That claim turned out to be wrong: VIRTUAL_PIN produces ghost routing bits that vanish on real-pin recompiles (same lesson as Phase 5.0 for M9K/DSP). The real arithmetic activation is a LAB-level mode switch in the block band (frames 1692-1738), not a per-LE cell region — see the Phase 5.4 follow-up narrative for details.
Phase 3: Cracking the Routing Matrix (In Progress)
The routing matrix is the "wire network" that connects all the LEs in an FPGA. This is the hardest part of the reverse-engineering effort.
Routing Resource Types
EP4CE6 routing resources:
┌─────────┐ C4 wire ┌─────────┐
│ LAB │ ←──(~4 rows)───→ │ LAB │
│ (X,Y) │ │ (X,Y+4) │
└────┬────┘ └─────────┘
│
R4 wire (~4 columns)
│
┌────┴────┐
│ LAB │
│ (X+4,Y) │
└─────────┘
C4 = Column wire, spans ~4 rows (21,816 wires)
R4 = Row wire, spans ~4 columns (28,186 wires)
C16 = Column wire, spans ~16 rows (1,326 wires)
R24 = Row wire, spans ~24 columns (1,289 wires)
LOCAL_INTERCONNECT = LAB-internal input mux
LE_BUFFER = LE output buffer
Methodology
-
STA routing extraction: After compilation, run
report_timing -show_routingto get the wire names on each path.Example: A → LCCOMB_X10_Y10 → C4_X10_Y10_N0_I0 → LOCAL_INTERCONNECT_X10_Y14 → LCCOMB_X10_Y14 → Q -
Control distance: Change the distance between two LUTs to force different routing resource types.
- Same column, dy=1: direct connection
- Same column, dy=2–4: 1 × C4 wire
- Same column, dy=5–8: 2 × C4 wires
- Same column, dy=9+: 3 × C4 wires
- Same row, dx=1–4: R4 wires
-
Routing determinism: After setting
ROUTER_TIMING_OPTIMIZATION_LEVEL MINIMUM, routing is completely deterministic — 5 different fitter seeds produce identical bitstreams. -
Parallel compilation: Python
multiprocessing.Pool(4 workers), effective speed ~4 sec/target.
C4 Switch CRAM Address Model (Verified: 63 wires, 0 false predictions)
# CRAM address for C4_X{x}_Y{y}_N0_I0 group = (y - 2) // 3 slot = (y - 2) % 3 byte_offset = LAB_CRAM_END(x) + SLOT_BASE[slot] + 3 * group bit_position = (6 - group) if slot == 2 else (7 - group) SLOT_BASE = {0: 2405, 1: 2475, 2: 2338}
This model uses exactly the same slot/group encoding framework as LUT TT (since they share the same CRAM address space), just with different base addresses.
R4 Switch CRAM Address Model (25 of 37 I-indices Mapped)
R4 row-wire switches are more complex than C4 — each R4 "I-index" has an independent BASE address:
# CRAM address for R4_X{wx}_Y{wy}_N0_I{idx} prev_lab_x = max(x for x in LAB_X if x < wx) # LAB column just before the wire's X prev_col_start = COLUMN_BASE[prev_lab_x] - 136 group = (wy - 2) // 3 slot = (wy - 2) % 3 # Three slots use different offset formulas if slot == 0: byte = prev_col_start + R4_BASE + 66 + 3*group + (1 if group > 0 else 0) bp = 7 - group elif slot == 1: byte = prev_col_start + R4_BASE + (-70) + 3*group bp = 6 - group # Note: NOT 7-group! This was a past mistake. else: # slot == 2 byte = prev_col_start + R4_BASE + 3*group bp = 6 - group
R4_BASE lookup table (each I-index has two pair base addresses, all in the PREV column):
| I-index | BASE pair1 | BASE pair2 | delta | Verified |
|---|---|---|---|---|
| 0 | 3423 | 3842 | 419 | Multiple columns |
| 1 | 3431 | 3850 | 419 | Multiple columns |
| 2 | 3431 | 3851 | 420 | prev=X4,X6,X10,X24,X28 |
| 3 | 3474 | 3895 | 421 | 3 Y values, cross-col |
| 4 | 3423 | 3842 | 419 | Same as I=0 |
| 7 | 3414 | 3835 | 421 | Same as I=10 |
| 10 | 3414 | 3835 | 421 | Multiple columns |
| 11 | 3378 | 3585 | 207 | 2 Y values |
| 12 | 3597 | 3806 | 209 | 2 Y values |
| 13 | 3577 | 3786 | 209 | Same as I=15 |
| 14 | 3191 | TBD | ? | pair1 verified, pair2 unconfirmed |
| 15 | 3577 | 3786 | 209 | prev=X12,X16,X24 |
| 16 | 3629 | 3835 | 206 | 2 Y values |
| 17 | 2802 | 3223 | 421 | 5 columns verified |
| 18 | 4057 | 4267 | 210 | 2 columns verified |
| 20 | 2791 | 3001 | 210 | Partial columns |
| 22 | 2783 | 2993 | 210 | Small sample |
| 25 | 2762 | 2972 | 210 | 2 columns verified |
Key findings:
-
R4 switches live in the PREV column: The CRAM bits for R4_X22 are in the X=21 column. This is consistent with the physical topology of the FPGA switching matrix — row-wire switches are controlled separately in each column they pass through.
-
Column dependency: All I-indices work correctly at standard-width (7,350-byte) columns but fail at large columns near M9K/DSP boundaries (X=13: 76,230 bytes; X=26: 68,880 bytes). Those large columns have internal sub-regions that require a more complex address model.
-
Two pair-spacing patterns: delta ≈ 420 for (I=0,1,2,4,7,10) and delta ≈ 210 for (I=18,20,22,25). The former spans two 210-byte periods; the latter uses adjacent periods.
-
Shared BASE values: I=0 and I=4 share the same BASE; I=7 and I=10 share the same BASE.
-
R4 wires are not only at LAB columns: 31% of R4 wires appear at non-LAB X coordinates (e.g. X=5,9,14,15,20,27,30,32,33), but their switch bits are still in the nearest LAB column.
LOCAL_INTERCONNECT Switch Model (Verified: 70% cross-validation, 22 columns)
LOCAL_INTERCONNECT is the LAB-internal input multiplexer — it decides which signals get connected to LE input ports.
# CRAM address for LOCAL_INTERCONNECT_X{lx}_Y{ly}_N{ln}_I{li} col_start = COLUMN_BASE[lx] - 136 # Note: SELF column, NOT prev column! group = (ly - 2) // 3 slot = (ly - 2) % 3 byte = col_start + 70 + pair * 210 + SLOT_OFFSET[slot] + 3 * group bp = (6 - group) if slot == 2 else (7 - group) SLOT_OFFSET = {0: 67, 1: -70, 2: 0} # same offsets as R4
Key characteristics:
-
In the self column: Unlike R4, LOCAL_INTERCONNECT bits are in the same column's CRAM. This makes sense — the LAB input mux is part of the LAB's own configuration.
-
Multiple pairs: Each I-index activates 1–9 pairs (in pair range 0–8, the lowest CRAM region), forming 4 fixed activation patterns:
Pattern Activated pairs Applicable I-indices All 9 0,1,2,3,4,5,6,7,8 I=2,15,16,18,22,33,34,35,36,37 Skip 3,7 0,1,2,4,5,6,8 I=0,30,31 Even pairs 0,2,4,6,8 I=24,26,28,29,32 First 2 per block 0,1,4,5,8 I=4,17,27 -
Pairs 0 and 4 are universal: Regardless of I-index, these two pairs are always activated.
R24 Switch CRAM Address Model (I=0 Mapped)
R24 row wires span ~24 columns. Their switches use a fixed byte offset model — simpler than R4:
# CRAM address for R24_X{wx}_Y{wy}_N0_I0 prev_lab_x = max(x for x in LAB_X if x < wx) prev_col_start = COLUMN_BASE[prev_lab_x] - 136 group = (wy - 2) // 3 slot = (wy - 2) % 3 bp = (6 - group) if slot == 2 else (7 - group) # same bp formula as R4/C4 # Fixed byte offsets — NO slot/group byte adjustment: R24_I0_OFFSETS = [3124, 2705] # primary (pair 14, pos 184), secondary (pair 12, pos 185) byte = prev_col_start + offset # same byte regardless of Y!
Key difference from R4: The byte address is fixed per pair — multiple Y values map to the same byte with only bp varying. This means reads are ambiguous if multiple Y values share the same bp (which happens when they're in the same group).
- R24 switches are in the PREV LAB column (same as R4)
- Primary pair: rel=3124 (pair 14, pos 184); secondary: rel=2705 (pair 12, pos 185), delta=419
- 5–6 wx columns verified at 66% accuracy via pair-diff
- 7 unique R24 I-indices observed; only I=0 (73% of wires) mapped
C16 Switch Analysis (Not Yet Mapped)
C16 column wires span ~16 rows. Preliminary analysis shows their encoding is fundamentally different from C4/R4:
- Pair boundary bytes (pos=209/0) show multi-bit changes, not single-bit switches
- XOR patterns across columns are inconsistent — no universal slot/group formula
- Routes using C16 are noisy (3–6 R4, 2–5 C4 wires per path), making isolation difficult
- Likely requires per-wire lookup table or a completely different methodology
C4 I≠0 Switch Model (24 per-(X,I) Mappings — 11 I-indices)
C4 I≠0 switches use the same fixed byte offset model as R24 — the byte address is constant for all Y values, and only the bit position varies:
# CRAM address for C4_X{wx}_Y{wy}_N0_I{ii} (I≠0) byte = C4_FIXED_OFFSETS[(wx, ii)] # absolute RBF byte offset — fixed, Y-independent group = (wy - 2) // 3 slot = (wy - 2) % 3 bp = (6 - group) if slot == 2 else (7 - group) # same formula as C4 I=0
Discovery method: baseline-diff — compile each route, diff against baseline.rbf, look for bytes in the self column whose bit at the expected bp is flipped. A byte that fires for multiple Y values of the same (wx, I) is the switch byte.
Mapped positions (24 per-(X,I) entries in _C4_FIXED_OFFSETS in bitstream.py):
| I-index | Columns mapped | Hit rate |
|---|---|---|
| 1 | X=9, 15, 16, 25 | 4–5/6 |
| 3 | X=13, 22, 25 | 2–3/5 |
| 7 | X=13 | 2/2 |
| 8 | X=13 | 2/3 |
| 9 | X=10, 28, 30 | 2–3/3 |
| 10 | X=9, 28, 29 | 2–5/6 |
| 12 | X=9, 10, 22, 25 | 5–6/6 |
| 14 | X=25 | 2/2 |
| 15 | X=16 | 2/2 |
| 20 | X=9 | 2/2 |
| 23 | X=22, 29 | 2–3/3 |
Key findings:
-
Fixed byte, varying bp: Unlike R4 (which adjusts byte offset per slot/group), the byte address for C4 I≠0 is Y-independent. Only
bpencodes the Y coordinate. -
No universal formula: The pair index varies per column for the same I-index. Per-(X,I) lookup is required.
-
pos always 184 or 185: All switches land at data-byte positions within the 210-byte period (identical to the LUT TT data byte positions).
-
Shared bytes: I=3 and I=12 map to the same byte at X=22 and X=25. These two I-indices are indistinguishable by CRAM inspection alone at those columns.
-
Non-LAB columns have larger pair numbers: X=9, X=15, X=30 have pair indices of 58, 382, 53 respectively — consistent with their wider CRAM regions.
-
RouteCodec integration:
read_c4()now handles both I=0 (formula) and I≠0 (lookup table) in a single call.
Routing Bit CRAM Distribution
One LAB column's CRAM (~7,350 bytes):
┌──────────────────────┐ low address
│ FF region A │
│ (2 pairs, ~420 B) │
├──────────────────────┤
│ │
│ LUT TT region │ 8 pairs × 210 bytes = ~1,680 bytes
│ + routing switches │ ← routing bits are interleaved with LUT TT pairs!
│ interleaved │
│ │
├──────────────────────┤
│ FF region B │
│ (2 pairs, ~420 B) │
├──────────────────────┤
│ C4 switch region │
│ R4 switch region │
│ other routing sw. │
└──────────────────────┘ high address
Routing switch bits and LUT TT bits use the same ctrl+data pair structure, interleaved at 210-byte spacing. The bit position (0–7) encodes the physical target Y region, not routing distance.
Bitstream Codec (bitstream.py)
Based on the above findings, we have built a fully functional codec that can read and write both LUT truth tables and routing switch states.
Core Class: RouteCodec
RouteCodec is the heart of the codec. It takes an RBF file and a zero-baseline file as reference:
from bitstream import RouteCodec # Create codec instance codec = RouteCodec("design.rbf", "zero_baseline.rbf")
Why two files? Because FPGA CRAM bits have a polarity issue — some bits are "1=enabled" and others are "0=enabled". By comparing an empty design against the zero baseline, the codec knows the default polarity of each bit and can correctly interpret on/off state.
LUT Truth-Table Read/Write
# Step 1: Calibrate a position (needs 16 minterm pair-diffs, ~2.5 min) python3 runner.py n_sweep 10 10 # Step 2: Read LUT truth table from RBF python3 analyze.py read_tt design.rbf zero.rbf 10 10 0 # Output: mask = 0x8888 (A & B) # Step 3: Write LUT truth table to RBF python3 analyze.py write_tt zero.rbf 0x6996 output.rbf 10 10 0 # Produces RBF with an XOR gate (A ^ B ^ C ^ D)
Verification results:
- CRAM region is bit-identical to Quartus output
- Only 14–16 bits differ in the header/CRC section (Quartus metadata; does not affect configuration)
- Verified masks: 0x0000, 0x0001, 0x8888, 0x6996, 0xFFFF, 0xAAAA, 0x5555, 0xDEAD, and more — 10 masks total
End-to-end hardware verification (2026-04-06):
The codec was verified on physical hardware (Heijin AX301 board, EP4CE6F17C8):
1. Codec write_tt(zero_baseline, mask=0x8888) → e2e_codec_and.rbf
2. Flash to FPGA: openFPGALoader -c usb-blaster e2e_codec_and.rbf
3. Hardware behavior: LED ON by default (keys floating high),
press KEY2 or KEY3 → LED OFF (correct A & B with active-low inputs)
4. Codec write_tt(zero_baseline, mask=0x6996) → e2e_codec_xor.rbf
5. Flash to FPGA: openFPGALoader -c usb-blaster e2e_codec_xor.rbf
6. Hardware behavior: press either key → LED ON, both keys → LED OFF (correct A ^ B)
The codec-generated RBF produces exactly the expected logic behavior — the bitstream codec works correctly end-to-end without going through Quartus.
Note: AX301 buttons are active-low (unpressed = logic 1, pressed = logic 0). The correct openFPGALoader path is $HOME/see_neorv32_run_linux/tools/openFPGALoader/build/openFPGALoader.
Routing Switch Read/Write (New)
The codec now also supports reading and writing routing matrix switch states:
from bitstream import RouteCodec codec = RouteCodec() design = open("design.rbf","rb").read() zero = open("zero_baseline.rbf","rb").read() # ========== Read all routing switches ========== sw = codec.read_switches(design, zero) # Returns {'c4': [...], 'r4': [...], 'r24': [...], 'li': [...]} # Each entry: (wire_name, byte_offset, bit_pos, candidates) # LI wire names use base granularity: "LI_X10_Y5_P3B0" # P3 = pair index, B0 = base offset 70 (B1 = base offset 71) # ========== Write switches into a blank baseline ========== ops = [ {'type': 'c4', 'x': 10, 'y': 5, 'i_idx': 0}, {'type': 'c4', 'x': 13, 'y': 8, 'i_idx': 3}, # uses I≠0 lookup {'type': 'r4', 'wx': 22, 'y': 8, 'i_idx': 17}, {'type': 'li', 'lx': 10, 'ly': 5, 'pair_bases': [(0,0),(0,1),(2,0),(2,1),(4,0),(4,1),(6,0),(6,1),(8,0)]}, ] new_rbf = codec.apply_routing(zero, ops) # ========== Pre-flash hardware safety check ========== codec.validate_safe_for_hardware(new_rbf, zero) # Raises RuntimeError if any LAB has an LI activation pattern outside the # known-safe envelope (wrong cell count, broken paired/alternating mode, etc.) open("output.rbf","wb").write(new_rbf)
Note the li op now requires an explicit pair_bases list — implicit "expand an I-index into all 9 pairs" was removed because real Quartus never activates more than 9 specific cells per LAB, and auto-expansion would have been a physical-contention hazard.
Current coverage:
- C4 I=0: 100% (all 63 wires correct)
- R4: 25/37 I-indices mapped (12 remaining blocked on insufficient route corpus)
- R24 I=0: mapped with fixed-byte model (~66% pair-diff accuracy), 73% of R24 wires
- LOCAL_INTERCONNECT: full read/write at base granularity, two encoding modes resolved
- C4 I≠0: no universal formula, per-wire lookup table (24 entries)
- C16: not yet mapped (fundamentally different multi-bit encoding)
Routing Codec Round-Trip + Hardware Safety Guard
After basic read/write was working, the next question was: does our codec actually round-trip? If we read all the routing switches out of a real Quartus RBF, then write them back into a blank baseline using only our own write methods, do we get the same set of cells back?
Round-trip self-consistency test
route_roundtrip.py runs this experiment:
real Quartus RBF ──► RouteCodec.read_switches() ──► list of switch ops
│
▼
blank zero baseline ──► RouteCodec.apply_routing(ops)
│
▼
re-read with read_switches()
│
compare against the original read
If the codec is consistent, the two reads must agree exactly: zero dropped cells, zero hallucinated cells. Note this is not a "match Quartus byte-for-byte" test — that would also require encoding LUT TT, IO buffers, etc. We're only testing the routing layer in isolation.
Result for both a column route (Y10→Y5) and a row route (X10→X22):
column route Y10→Y5: OK orig=52 repro=52 common=52
row route X10→X22: OK orig=65 repro=65 common=65
0 dropped, 0 hallucinated. The routing codec is internally consistent.
To make this work we had to add two things:
write_c4_inz()— C4 with non-zero I index uses fixed byte offsets instead of the universal slot/group formula. We mined 24 (X, I) → byte mappings by baseline-diffing fresh compiles.'raw'switch type — for R24/LOCAL_INTERCONNECT, the per-wire write methods set more cells than a single read entry corresponds to (one wire activates 2+ cells). When replaying a read, we instead emitrawops that flip exactly one (offset, bit) — the same granularity as the read.
Hardware safety guard V2
Before flashing a codec-generated RBF to a real AX301 board, we want to refuse anything that could short out a LAB input mux. Multiple routing channels driving the same LE input port at the same time is a physical-contention hazard on real silicon.
RouteCodec.validate_safe_for_hardware(rbf, zero) scans the RBF for LOCAL_INTERCONNECT activations and refuses to pass anything Quartus has never been observed to produce.
codec = RouteCodec() codec.validate_safe_for_hardware(my_rbf, zero_rbf) # raises if unsafe
The "what's safe" envelope was discovered empirically. After dropping a sloppy break in read_local_interconnect() that was hiding cell-level structure, we re-ran a 21-LAB sweep and found that every Quartus LI activation falls into one of two well-defined modes, each with exactly 9 active cells per LAB:
- Paired mode (13/21 LABs, mostly column moves):
P0paired (both bases set) + 4 middle pairs paired +P8tail (one base) = 9 cells - Alternating mode (8/21 LABs, mostly row moves):
P0..P7each with one base, alternatingB1,B0,B1,...,B0+P8tail = 9 cells - Universal anchors:
P0andP8always present; cell(P0, B1)is in every observed class
What is a "pair" and a "base"? LOCAL_INTERCONNECT cells live in a 210-byte-period region of each LAB column. In each period, two CRAM bytes at offsets 70 and 71 ("base 70" / "base 71" —
B0/B1) are the LI bytes. The pair indexP0..P8is which 210-byte period within the column we're in.
The V2 classifier _classify_li_lab(pair_map) tags any LAB as paired, alternating, or invalid (with a reason). The guard refuses to pass:
- Any LAB with more than 9 active cells
- Missing
P0orP8anchor;P8doubly set - Paired mode with single-base middle pairs (broken paired)
- Alternating mode with the wrong base on any pair, or any doubled middle pair
This was tested against all 6 observed Quartus classes (all accepted) and 4 synthetic violation cases (all rejected). The previous V1 guard (max 5 pairs per LAB) was actually wrong: it would have false-rejected 13 of the 21 legitimate Quartus configurations.
What this resolved: the "9-pair vs 5-pair" mystery
For a long stretch of this work the same routing key was producing what looked like two completely different bit patterns at different LABs — sometimes 10 byte flips at 5 pair positions, sometimes 9 byte flips at 9 pair positions. We thought these were structurally distinct encodings.
They aren't. They're the same 9-cell envelope counted two different ways:
- "5 pairs × 2 bytes = 10 flips" was counting only paired pairs and missing the P8 single-byte tail (it's actually 4 paired + P0 paired + P8 single = 9 cells)
- "9 pairs × 1 byte = 9 flips" was already counting cells correctly
The old reader's break after the first base hit was masking the difference between paired and alternating modes. Once we emitted one read entry per (pair, base) cell, the structure became obvious.
Route Synthesis: Island Hopping
Once the read-side codec was solid, the next question was the inverse: given a (src, dst) pair, can we synthesize a routing bitstream that matches Quartus cell-for-cell? A formula-driven synthesizer turned out to be the wrong frame. We discovered that:
Cyclone IV CRAM is interleaved, not topologically isomorphic to the chip. The routing-state CRAM cells for each LE live in non-overlapping physical regions far from the source LAB column, and cross-source fingerprint intersection is empty — there is no universal "source entry code" that generalizes across source LABs.
So route_synth (in fuzz/route_synth.py) takes a different tack: per-source corpus mining + bit-perfect snapshot replay. Each "green-zone island" is a (sx, sy) source LAB for which we have:
- A small corpus of
lits_pair_X{sx}Y{sy}_to_*Quartus compiles - A source fingerprint (cells present in 100% of routes from that source)
- A per-route delta (the remaining cells per dst, as raw
(offset, bit)pairs)
For any dst already in the corpus, synth_route() emits fingerprint ∪ delta[dst] as raw cell flips and produces a bitstream that matches Quartus byte-for-byte in the routing region. For dsts outside the corpus, it falls back to the formula-based plan (C4/R4/R24 hops + LI envelope) and is gated by validate_safe_for_hardware() so it can't drive a LAB into an unknown LI activation pattern.
Three islands so far
| Island | Location | Routes | Fingerprint bits | Bit-perfect | Round-trip | Safe (synth/quartus) | Yellow zone |
|---|---|---|---|---|---|---|---|
| α | (10, 10) — interior | 31 | 6 | 31/31 | 31/31 | 31/31 / 31/31 | 3/3 |
| β | (10, 14) — M9K boundary (Y15 ghost row) | 11 | 11 | 11/11 | 11/11 | 11/11 / 11/11 | 3/3 |
| γ | (4, 4) — corner | 16 | 1 | 16/16 | 16/16 | 16/16 / 16/16 | 3/3 |
| Total | 58 | 58/58 | 58/58 | 58/58 | 9/9 |
A few non-obvious findings from the islands:
- The (4, 4) corner has the smallest fingerprint of all three (1 bit,
R4_X11_Y5_N0_I3). The expectation that corner LABs would need more "edge bits" turned out to be wrong — the corner's per-route delta absorbs almost everything. - An earlier "GND-tie hypothesis" — that the (10, 14) fingerprint's 11 bits were artifacts of unrouted lut2 inputs being tied to GND — was falsified by a controlled multi-input compile (
purify_fingerprint.py). With all 4 lut2 inputs routed to real signals, the fingerprint slightly grew instead of shrinking. - Several universal "always-on" structures were extracted from the corpus and are emitted unconditionally by
emit_ops()for any inter-LAB route from a known source: a source-side R4 launch driver (R4_X{sx+1}_Y{sy}at I=1 and I=2), a source-column R24 broadcast hold (5 raw bits), and an LI source-driver MUX (P8B0+P8B1) skipped only for adjacent ±1 horizontal hops. These were each mined as 100% across the correspondinglits_pair_*corpus.
Tests
fuzz/test_green_zone_harden.py auto-discovers all results/fingerprint_{sx}_{sy}.json snapshots and runs five checks per island (bit-perfect vs. Quartus, codec round-trip, safe-synth, safe-quartus, fingerprint drift) plus three "yellow zone" probes (dsts NOT in the corpus, must at least pass validate_safe_for_hardware). All three current islands pass with zero drift.
Mode-selection rule (still partially open)
We mined the 21 classified LABs to see what predicts paired vs alternating:
| Feature | Predictive? |
|---|---|
| Column move (dy != 0) | ✅ All 7 column moves → paired |
| Row move (dx != 0) | ⚠ Mixed: 8 alternating + 6 paired |
| Adjacency to non-LAB columns (X=5,9,14,15,20,27,30) | ❌ No correlation |
| dst_x parity | ❌ No correlation |
| LAB-list index distance | Weak correlation, exceptions exist |
So column moves are deterministic, but the row-move split is not yet derivable from a single feature. The most likely missing variable is the last R4/C4 hop's I-index before LI — that's what selects the LI input mux tier. Resolving this needs a richer routing-paths corpus with multi-LE designs.
Quick Start
Environment Setup
# 1. Install Quartus Prime 21.1 Lite # Download from Intel website, install to ~/intelFPGA_lite/21.1/ # 2. Configure PATH export PATH=$PATH:$HOME/intelFPGA_lite/21.1/quartus/bin # 3. Enter project directory cd fuzz
Basic Operations
# Generate baseline RBF python3 runner.py baseline # Pair-diff at (X=10, Y=10, N=0) python3 runner.py --node lut_inst lut_single 10 10 0 # View database summary python3 analyze.py summary # View truth-table mapping for a specific LE python3 analyze.py lut_table 10 10 0 # Export full database as JSON python3 analyze.py export
Advanced Operations
# Sweep all 16 minterms at one position (calibrate the codec) python3 runner.py n_sweep 10 10 # Pair-diff grid sweep across all 22 columns python3 runner.py pair_diff_grid # Parallel routing fuzzing (4 workers) python3 runner.py route_map_parallel 10 5 col # Batch routing fuzzing (multiple source positions, multiple directions) python3 runner.py route_map_batch --sources 4,10 29,10 10,17 --direction col --jobs 4 # Read truth table from RBF python3 analyze.py read_tt design.rbf zero.rbf 10 10 0 # Write truth table to RBF python3 analyze.py write_tt zero.rbf 0x8888 output.rbf 10 10 0
Known Pitfalls and Caveats
- Left-edge columns (X=3,4,6,7) have CRAM addresses below 0x10000, unlike other columns
- Quartus fit reports contain non-UTF-8 bytes — use
errors="replace"when reading - Do not run multiple fuzzing campaigns in parallel sharing the same
work/directory — they overwrite each other's files sof2rbf.pyproduces invalid bitstreams — always usequartus_cpf -c -o bitstream_compression=off- Some LAB locations are invalid: combinations with X∈{3,4,6,7,8} and Y∈{12,13,14,16} are rejected by Quartus (those positions may be occupied by M9K or other hard blocks)
- Disk space: Phase 3's
work/directory grows very rapidly; clean it after each compilation (compile.pyprovidesclean_work_dir()) - Cross-check against Quartus before chasing codec bugs — if a design doesn't work through the open-source toolchain, first build the exact same Verilog through Quartus and flash that reference RBF. If Quartus's version blinks and yours doesn't, then do the cell diff to see whether the difference is where you expected. In the M5 counter episode we spent a substantial stretch chasing five "real but irrelevant" low-level bugs because we skipped this 30-second experiment — the real problem wasn't in the codec at all, it was that nextpnr-generic has no carry-chain primitive
- Self-loop sig-cache entries are unmineable with current templates — for routes where src LE == dst LE (an LE feeding back into one of its own dataX inputs), the two-LUT pair mining template cannot structurally represent
src==dst, and the diff-vs-baseline strategy fails because Quartus re-fits between compiles (including pin reassignment). Any design that relies on self-feedback (canonical example: a ripple adder that doesn't use the carry chain) cannot produce a valid bitstream through the open-source toolchain until Phase 5.4 lands — use the Quartus reference RBF in the meantime
Debugging Journey: How We Closed the Hardware Loop
This section is a narrative for newcomers — it walks through the actual debugging sessions that turned the codec from "bit-perfect against Quartus" into "the silicon accepts our handcrafted bitstream and the LED responds to keys exactly as we designed." Every step here was a real problem we hit, and most of them were not obvious before we hit them.
The starting point: codec output looked perfect, but the FPGA refused it
After Phase 3 we had a RouteCodec that could read routing switches from any
Quartus-generated .rbf, replay them onto a blank baseline with apply_routing(),
and re-read them losslessly. Diffing our codec output against the original
Quartus RBF showed 0 CRAM byte differences — every configuration cell was
identical. Time to flash it on real hardware.
We connected a 黑金 AX301 board (EP4CE6F17C8 + USB-Blaster JTAG) and ran:
openFPGALoader -c usb-blaster results/rbf/lits_synth_X10Y10_to_X12Y10N0_datab.rbf
The flash appeared to succeed. But on the board, the LEDs started running a
"chasing lights" demo (跑馬燈) that we had never compiled. The FPGA was running
the vendor demo from the EPCS configuration flash, not our bitstream. As a
sanity check we tried flashing a known-good Quartus-built RBF — that one ran
correctly. We even tried a deliberate single-bit flip of a working Quartus
RBF (lits_pair_BITFLIP_test.rbf) — the FPGA rejected that one too and fell
back to the EPCS demo.
Conclusion: the Cyclone IV configuration state machine validates the bitstream as it loads. A single byte off and the chip silently boots from flash instead. We must have a CRC or checksum somewhere in the RBF, and our bit-perfect-CRAM trick was leaving it stale.
Discovering and reverse-engineering the CRC
We had no datasheet for the .rbf format, so we had to deduce the CRC algorithm purely from observed bitstreams. Here is how we did it.
Step 1 — Is the CRC stateful? A CRC could be one rolling value over the entire bitstream, or one independent value per fixed-size frame. We searched the corpus for frame pairs whose data bytes were identical: if their CRC bytes also matched, the algorithm was stateless (frame-independent). We found 1186 such identical-data frame pairs across the corpus, and in every single case the trailing CRC bytes matched. ✓ Stateless. The CRC is computed on each frame independently.
Step 2 — Find the frame size. RBF total size is 368,011 bytes. Subtracting the 32-byte 0xFF preamble and 59-byte 0xFF postamble leaves 367,920 = 1752 × 210. Bingo: 1752 frames of 210 bytes. Each frame likely has 208 data bytes followed by 2 CRC bytes (little-endian).
Step 3 — The ΔCRC linear search. This is the heart of the trick. Instead of brute-forcing the absolute CRC of one frame against 65,536 polynomials (which gave us zero hits — too many degrees of freedom), we used a linear constraint:
- Construct two synthetic 208-byte payloads that differ in exactly one byte (e.g., byte 100 = 0x10 vs. byte 100 = 0x10 AND byte 101 = 0x10).
- For each candidate polynomial × bit-direction variant, the CRC difference between the two payloads is determined entirely by the polynomial — no need to know the init value.
- Require that the same polynomial satisfies both ΔCRCs simultaneously (dual-constraint). This collapses 65,536 candidates × 4 bit-directions down to almost nothing.
Two polys survived: the standard 0x8005 and a low-weight collision 0x0006. 0x8005 reflected is 0xA001 (the right-shift form). That's CRC-16-IBM.
Step 4 — Brute-force the init value. The polynomial alone doesn't fix the
CRC — there's also an initial register value. Once we knew the polynomial, we
took 1316 frames in the corpus that contained all-zero data and required
crc16(zeros, poly=0x8005, init=?) == observed_value (0x7d9a). Only one init
satisfied: 0xFE54.
Step 5 — End-to-end verification. With the formula nailed down:
def crc16_rbf(data208: bytes) -> int: crc = 0xFE54 for b in data208: crc ^= b for _ in range(8): crc = (crc >> 1) ^ 0xA001 if (crc & 1) else (crc >> 1) return crc
…we ran it across all 1752 frames of a known-good Quartus RBF. Result: 1727 frames matched, 25 failed. The 25 failures were a contiguous block — frames 0..24. That's the bitstream header (sync words, config registers, device-wide options). The header is not CRC-protected; only frames 25..1751 (the CRAM frames) carry an enforced CRC. Once we excluded the header from the patcher, every CRAM frame's CRC reproduced exactly.
The full spec lives in bitstream.crc16_rbf_frame() and patch_rbf_crc().
The codec / CRC byte overlap (and how we fixed it)
Plugging the CRC patcher into synth_route() and re-running the green-zone
regression test exploded: 58/58 routes bit-perfect → 0/N. The patcher had
broken the codec.
Why? The CRC bytes live at offsets +208 and +209 of each 210-byte frame.
But the codec scans LAB columns using a 210-byte period that is not
frame-aligned (column bases vs. frame starts differ). So the codec's
"slot 1" or "slot 2" reads occasionally land exactly on bytes that the CRC
patcher just rewrote — and the diff against the zero baseline picked up the
CRC difference as if it were a routing change.
The fix is conceptually simple: before computing any routing diff,
mask out the CRC byte positions so they look identical to the baseline.
That's mask_rbf_crc_bytes() in bitstream.py, called automatically at the
top of read_switches(). With that in place we could turn patch_crc=True
on by default in synth_route() and the green-zone regression returned to
58/58.
First hardware loop closure
With the CRC patcher integrated, we re-flashed our codec-built routing RBF. This time the JTAG load completed and the LEDs stayed quiet — the EPCS demo did not take over. The FPGA was running our bitstream. Loop closed.
Then we tried LutCodec.write_tt(minterm_0_baseline, mask=0xFFFF) —
overwrite the LUT truth table with constant-1. Flash, verify: LED ON.
Flash the Quartus-built minterm_0 (mask 0x0000, constant-0) as a control:
LED OFF. Opposite states confirm LutCodec.write_tt reaches silicon.
Surprise observation: when we ran patch_rbf_crc() on the LutCodec output,
it changed 0 bytes. The CRC was already valid. Why? Because LutCodec's
bit patterns were trained from Quartus pair-diffs that already include the
CRC byte changes — so writing a new TT implicitly produces a CRC-correct
bitstream. RouteCodec doesn't have that property because it uses
RE-derived formulas, not pair-diff replays.
Hardware-probing the AX301 pin map
To build a real functional demo we needed LED0 = f(K1, K2, K3, K4) to
behave correctly. But our config.py had D = PIN_E15 # RESET — labeled
as a reset pin, not a key. Earlier hardware experiments (LED = A & B & C & D)
had shown LED stuck ON regardless of key presses, hinting that the pin labels
might be wrong. We didn't trust the AX301 schematic PDF (and couldn't easily
get one), so we built a silicon pin scanner.
The technique: write a one-line Verilog assign LED = K, compile it 4 times
with K bound to a different candidate pin (PIN_E16, PIN_M16, PIN_M15,
PIN_E15), all driving LED0 = PIN_G15. Flash one at a time. Press all 4
physical keys after each flash. Whichever key turns the LED off is that
pin. (The keys are active-low, so pressing pulls the input to GND, and
assign LED = K propagates that 0 to LED0.)
Four flashes, four answers (pin_probe.py):
| PIN | Physical key |
|---|---|
| PIN_E15 | KEY1 (was mislabeled "RESET") |
| PIN_E16 | KEY2 |
| PIN_M16 | KEY3 |
| PIN_M15 | KEY4 |
| PIN_G15 | LED0 (active-high) |
The D input was wired to KEY1 all along — not a reset pin. With this
silicon-verified table we updated config.py and recorded the map in memory.
The final functional demo and the XOR-delta footgun
Goal: "Hold K1+K2 OR hold K3+K4 → LED on, otherwise LED off." This uses all 4 inputs and gives a satisfying physical interaction.
With FUZZ_PINS A=K2, B=K3, C=K4, D=K1 and active-low keys, the function is
Q = (¬D ∧ ¬A) ∨ (¬B ∧ ¬C). Computing the truth-table mask bit by bit gives
0x0357 (bits {0,1,2,4,6,8,9} set).
We wrote that mask onto a minterm_0_X10_Y10_N0.rbf baseline using
LutCodec.write_tt(), patched the CRC, flashed, and started pressing keys.
5 out of 6 cases worked. One case was wrong: pressing all 4 keys
simultaneously gave LED OFF, but our function says it should be ON.
Round-trip read of the codec output returned 0x0357 — exactly what we wrote.
So why did hardware say bit 0 was 0?
The bug: LutCodec.write_tt(base, mask) is not absolute. It is XOR-delta
against base. The codec computes which CRAM cells differ from the true
0x0000 baseline for mask, and XORs those cells onto whatever base you
pass it. The hardware truth table is therefore base_tt XOR mask, not mask.
Our base was minterm_0_X10_Y10_N0.rbf. Look at what minterm_0 actually
contains: it's the design Q = ~A & ~B & ~C & ~D, which outputs 1 only when
all inputs are 0. So minterm_0's LUT TT is 0x0001 — bit 0 is already set.
Hardware TT after our write was therefore 0x0001 XOR 0x0357 = 0x0356. Bit 0
of 0x0356 is 0. That's exactly the case where all 4 keys are pressed —
input pattern (D,C,B,A) = (0,0,0,0) → TT[0] → 0 → LED OFF. The bug aligned
perfectly with the symptom.
read_tt is symmetric (it also returns the delta against base), so the
round-trip read couldn't catch the bug — both writer and reader use the same
XOR convention.
The fix is one line: write mask ^ base_tt instead of mask.
TARGET = 0x0357 MASK = TARGET ^ 0x0001 # compensate for minterm_0's TT[0]=1
Reflash. Press all 4 keys. LED ON. Press just K1+K2: LED ON. Press K3+K4: LED ON. Press anything else (single key, K1+K3, K2+K4, etc.): LED OFF. Full truth table verified by physical key presses.
Bonus discovery: EP4CE6 and EP4CE10 are the same physical die
A natural question after Phase 3 was: "could we cross-validate our CE6
findings against EP4CE10, since they share the F17 package and are rumored
to be the same silicon?" Rather than guess, we ran the cleanest possible
experiment via fuzz/cross_device_diff.py:
- Compile a one-line Verilog (
assign LED = K) with identical pin assignments under two device targets:DEVICE = EP4CE6F17C8DEVICE = EP4CE10F17C8
- Byte-diff the resulting RBFs.
Result:
| EP4CE6F17C8 | EP4CE10F17C8 | |
|---|---|---|
| Size | 368,011 bytes | 368,011 bytes |
| SHA1 | b47e804074b05d3d… |
b47e804074b05d3d… |
| Byte differences | 0 |
Not "almost identical" — byte-for-byte identical, including the header bytes that carry the device ID. Altera did not even add a CRAM bit to gate the disabled region. The "6,272 LE vs 10,320 LE" difference exists entirely as a software constraint inside Quartus; the silicon is the same metal masks, the same fuses, the same device ID in the bitstream.
Why this matters strategically. Re-fuzzing CE10 to rebuild Phase 1/2
data would be 100% redundant — the SQLite would be a duplicate. But the
result unlocks a much more powerful trick: CE10 is a "jailbroken Quartus"
for CE6. Whenever Quartus refuses to place logic in a region the CE6
software profile considers off-limits (M9K boundaries, the huge X=13 / X=26
columns, regions reserved for the larger LE pool), we can switch the
project's DEVICE to EP4CE10F17C8, force the placement, compile, and feed
the resulting RBF straight back into the same RouteCodec / LutCodec / CRC
patcher — because the underlying CRAM is unchanged. The bits the CE6
software refuses to generate live in the same place; we just need a
different software profile to coax them out.
The full jailbreak: CE6's fabric map is a lie
2026-04-07. Armed with the "same die" result, we set out to actually
touch the silicon Altera hides. The method is embarrassingly simple:
write a trivial Verilog that locks one cycloneive_lcell_comb to a
specific LCCOMB_Xa_Yb_N0 inside a DEVICE = EP4CE10F17C8 project, run
quartus_fit, and read the fitter verdict. "Fitter was successful" =
that coordinate physically exists in the fabric. "illegal location assignment" = Quartus is (still) refusing. By sweeping a grid we get a
yes/no map of what is actually on the die.
The results are brutal:
CE6 claims (config.py / CLAUDE.md) |
Reality on CE10 probe |
|---|---|
LAB_X = [3,4,6,7,8,10,11,12,13,16,17,18,19,21,22,23,24,25,26,28,29,31] (22 cols) |
28 cols — add X=5, 9, 14, 30, 32, 33 |
NON_LAB_X = {5, 9, 14, 15, 20, 27, 30} (7 cols M9K/DSP/PLL) |
Only {15, 20, 27} — the other four are real LABs |
LAB_Y = [2..14, 16..21] (19 rows, Y=15 skipped) |
20 rows — Y=15 is a real LAB row at X ∈ {10,14,16,21,25,30,31,32,33,...} |
| Total LABs: 392 | ~520+ |
| Total LEs: 6,272 | 10,320 (matches CE10 datasheet exactly) |
In other words, four of the seven columns CE6 marks as "non-LAB" are lies; one entire row (Y=15) is a lie; the two rightmost columns (X=32,33) are a lie. The fitter has a hard-coded whitelist that deletes ~40% of the die and relabels the chip as a smaller part.
Live-LE proof by XOR chain. Claiming a coordinate exists and
claiming that LE is functional are two different things — rebinning is
often driven by yield failures in specific columns. To separate the two
we built a single-bitstream dead-cell scanner
(jailbreak/scanC_gen.py):
chain[0] = K1 ^ K2
for each forbidden LE i:
(* keep, preserve *)
chain[i+1] = cycloneive_lcell_comb(dataa=chain[i], lut_mask=0xAAAA) // identity
LED = chain[N]
Every LUT passes its dataa straight through. The math reduces to
LED = K1 ^ K2 if and only if every cell in the chain behaves. A
single stuck-at, broken routing channel, or misconfigured LUT mask flips
the output parity on at least one of the four key combinations, and the
LED reports the damage.
Three phases, three flashes on the AX301:
| Phase | Scope | LEs in chain | Hardware result |
|---|---|---|---|
| A | X ∈ {32,33}, Y ∈ [2..21], N=0 | 40 | ✅ full truth table match |
| B | X ∈ {32,33}, Y ∈ [2..21], N ∈ {0,2,…,30} | 640 | ✅ |
| C | X ∈ {5,9,14,30,32,33} (hidden cols) + Y=15 row, full N | 1,840 | ✅ |
2,480 distinct CE6-hidden LEs, four key combinations each, every single one behaves exactly as pure silicon should. This particular AX301 board is not a rebin reject — it is a fully functional CE10 die that Altera sold as a CE6.
What this means for the project: the existing CRAM / C4 / R4 / LI
models do not need to be thrown out. They just need to grow. Each of
the six newly-discovered LAB columns needs one COLUMN_BASE entry, and
the Y=15 row needs to be added to LAB_Y; every other part of the
model — pair spacing, slot/group encoding, LI mode taxonomy, CRC frame
layout — carries over because the silicon underneath is identical. The
routable fabric grows by ~32%, the addressable CRAM by 0 bytes.
We deliberately do not auto-enable the expanded map in bitstream.py
yet. The expansion must be gated on: (1) per-new-column CRAM base
mining via baseline-diff, and (2) a green-zone regression on at least
one new source in X ∈ {32,33} to confirm the RouteCodec invariants hold
at the fabric edge. Both are mechanical follow-ups — no new physics.
Cross-die comparison (EP4CE15 / EP4CE22) is a different question entirely — those are likely "Die B" with different column counts and would require re-deriving column bases. We are deliberately not pursuing them yet: finishing CE6 routing coverage is a faster path to a working open toolchain than chasing a wider device family.
What we learned
- Trust silicon, not datasheets. The AX301 pin labels in our config were wrong; a 4-flash hardware probe gave the correct map in 5 minutes.
- Bit-perfect ≠ flash-clean. A bitstream can be byte-identical in CRAM and still get rejected because of header CRC, frame CRC, or other gating structures the chip checks during configuration.
- Use linear constraints for unknown CRCs. Brute-forcing 65,536 polynomials against an absolute CRC fails (too many free parameters). Brute-forcing against a difference of two carefully chosen frames collapses the search instantly.
- Read and write codecs must use the same baseline convention. A round-trip read can pass while the absolute hardware behavior is wrong, if both sides share the same XOR-delta assumption. Always validate against physical behavior, not just self-consistency.
- A working LED on hardware is worth a thousand passing unit tests. Every bug above slipped past our software checks and only revealed itself when the LED on the board did the wrong thing.
The codec stack now has a closed loop:
Verilog idea → LutCodec.write_tt → patch_rbf_crc → openFPGALoader
↓
real EP4CE6 silicon
↓
LED behaves as designed
From this point forward, we no longer need to round-trip through Quartus to validate codec changes — we can write the bitstream ourselves and watch the chip respond.
The CRC ghost: how a "working" mapping turned out to be fake (2026-04-08)
Late in the project we caught ourselves celebrating a mapping that wasn't real. Here's what happened, in plain terms.
The bitstream is laid out as 1727 "frames" of 210 bytes each. Inside every frame, the last 2 bytes are a CRC-16 checksum — the chip uses them to detect flipped bits during loading. We knew that.
The chip's configuration RAM is also organized in a per-column grid. Each column repeats every 210 bytes too (coincidence driven by the same hardware geometry). It was tempting to assume "column position 184" inside a pair was the same as "frame position 184". It isn't — the column grid and the frame grid start at different bytes, so they are shifted relative to each other. A cell that looks like "pair 13, position 184" in column coordinates can physically be position 208 of some frame — which is a CRC byte.
For a while we'd been mining "R24 switch bits" and "flip-flop control bits" by XOR-diffing two Quartus-compiled .rbf files. Any single logic change causes Quartus to recompute the CRC for the affected frames, so the XOR diff picks up both the real bit that changed AND the two CRC bytes of that frame. We never noticed, and kept cataloguing the CRC bytes as if they were real configuration cells. The "pair delta of 419-420 bytes" we had celebrated as the spacing of the R24 switch structure was actually 2 × 210 = 420 — the distance between two consecutive frames' CRCs.
The smoking gun came from a one-line check: take every "mapped" cell
in the codec and ask (offset - 32) % 210 >= 208. If that's true,
the cell is physically a CRC byte, not a configuration byte. The
results were devastating:
R24 I=0fixed offsets: 56 of 56 cells (100%) were CRC bytesFF async-resetcontrol cells: 448 of 476 cells (94%) were CRCFF enablecontrol cells: 168 of 168 cells (100%) were CRC
The route synthesizer still passed all 1725 regression tests anyway.
At the very end of generating any .rbf we call patch_rbf_crc(),
which recomputes every frame CRC from the frame data. So the R24
writes flipped CRC bytes, and then patch_rbf_crc immediately
overwrote those same bytes with the correct values. The writes were
effectively no-ops. The real R24 bits were being emitted through
other paths (probably buried inside the LI and C4 envelopes we also
mine), which is why routes still worked on silicon. The tests looked
bit-perfect, but only because patch_rbf_crc was idempotently
cleaning up after a broken write path.
We disabled the affected FASM directives with a hard error, wrote the
finding into project memory as a critical warning, and queued a
re-audit of every other mapping in the codec using the same
(off-32) % 210 >= 208 filter.
The fix: re-mining FF control bits through a CRC-normalized diff (2026-04-08)
A few hours after the CRC ghost finding, we re-ran the FF mining with
two changes: every .rbf was passed through patch_rbf_crc before
the XOR diff (so CRC bytes cancel out cleanly), and we compiled each
variant multiple times to average over Quartus's placement choices.
Round 1 compiled {base, arst, ena} with 8 different SEED values at a
fixed output pin. A pleasant surprise: for a trivial D flip-flop design
the base-vs-base diff was zero bytes across all 8 seeds. The
"fitter noise wall" that had blocked earlier FF mining was specific to
loaded designs with routing competition — not an inherent property of
Quartus. After subtracting CRC bytes, arst and ena each produced ~75-95
cells that flipped in every single seed.
Round 2 repeated the experiment with a fixed seed but 10 different output pins, scattering the FF to 10 different LABs across the die. Intersecting the round-1 and round-2 universal sets gave cells that are both seed-deterministic AND placement-independent — i.e. truly device-global FF control bits.
The final count was 61 arst + 61 ena cells, with arst ∩ ena = 48
shared "any-FF-with-ctrl" enables and 13 mode-specific bits on each
side. Most live in the bitstream header band (offsets below 5282, which
patch_rbf_crc never touches), arranged as a compact bitfield:
- offsets 73-74 hold the main FF mode byte; arst uses bits {1,5,7} and {0,1,3,4,6}, ena uses a different subset of the same two bytes
- offsets 42-52, 710-729, and 1074-1081 contain supporting bitfields
The 13 CRAM-band cells per mode stayed at the same absolute offsets across all 10 placements, which means Quartus always routes the FF control-signal tree through the same fixed global clock/reset network — those cells configure that network, not any specific LAB.
This gives us a three-layer picture of how FF features are encoded:
- Device-global ctrl feature bits — now mined (61 per mode)
- Per-LE mode bit ("this LE's FF uses arst") — still open
- Per-LE FF presence bit — partly captured in the LUT codec
FFCodec in fuzz/bitstream.py was rewritten to load
results/ff_remine_final.json at import time and flip the 61 absolute
offsets directly. A round-trip test confirmed that
FFCodec.write_arst(base) produces a byte-identical match against a
real Quartus-compiled arst .rbf on all 61 global bits. FASM's DFF.ARST
and DFF.ENA directives are still disabled pending per-LE mining.
The old column-relative _FF_ARST_CELLS and _FF_ENA_CELLS tables are
kept in the file as deprecated stubs, with comments pointing at the
CRC-ghost memory note, so anyone reading the history can see both the
wrong and the right answer side by side.
Phase 4.5 — Scaling the FASM chain to real designs (2026-04-09)
Up to Phase 4 we had proven that a FASM source file like
X10Y10N0.LUT = 0x8888 could be compiled, flashed, and run on real
AX301 silicon. That was a huge milestone, but it had a hidden limit:
our signature cache route_cells.json (1725 entries) was mined from
only 15 "green-zone" LAB positions, all with source-N = 0. A real
CPU design like NEORV32 has thousands of logic cells scattered across
the whole chip, with flip-flops at every even N slot, with feedback
paths and routing hubs that never look anything like our 15 training
islands. How do we know the chain generalises?
Phase 4.5 is the scaling experiment. Its goal: can the FASM chain reproduce an edge from a real NEORV32 compile, byte-for-byte, on silicon, including in the "jailbreak" columns that Quartus officially forbids?
What is a "signature cache"?
Before we answer that, it's worth explaining what our cache actually
is. It is not an analytic formula. When we say "the bits for route
X5Y3N4 → X4Y3N6.datad are the following 144 CRAM cells", we don't
compute those 144 cells from geometry — we observed them in a real
RBF that Quartus produced for exactly that route, and we stored the
observation in a giant JSON dictionary. The cache key is the route
tuple; the value is the list of (byte_offset, bit_position) pairs
that Quartus flipped.
This is the same idea as IceStorm's fuzzing approach, but applied to individual routing edges rather than chip-wide features: instead of asking "what bits does this feature control?", we ask "what bits does this specific source-to-destination wire require?". At flash time we don't need to know why those bits are what they are — we just copy them out of the cache and XOR them into the baseline.
The catch: the cache only knows routes it has seen. If you want it to cover NEORV32, you have to compile every edge NEORV32 uses at least once.
Plan D' — a 12,000-compile factory
Step 1 was a dry-run: we parsed NEORV32's static-timing report
(quartus_sta-generated 3.6 GB text dump) and extracted every routing
edge the compiler actually used. After dedup and self-loop removal we
ended up with 12,259 unique edges, each one a 7-tuple
(sx, sy, sn, dx, dy, dn, port). This is the "order list".
Step 2 was the factory itself: fuzz/plan_d_prime_factory.py. It
spawns 12 parallel Quartus worker processes, each one assigned one
edge at a time. The worker writes a minimal two-LUT Verilog design
(lut1 → lut2 with a clock register to keep Quartus from optimizing
it away), forces both LUTs into the exact coordinates the edge
describes, runs a full Quartus compile, and saves the resulting RBF
as nv_pair_X{sx}Y{sy}N{sn}_to_X{dx}Y{dy}N{dn}_{port}.rbf. Before it
starts, the factory filters out edges that its two-LUT compile
template physically cannot place: odd-N self-mappings collapsed by
the N-normalizer, IO-ring coordinates, and non-LAB columns
(X ∈ {15, 20, 27}, the M9K / multiplier blocks). 12,259 raw edges
come down to 11,762 placeable edges; the remaining 497 are
unrepresentable in this strategy, not lost. At a steady-state rate of
~0.28–0.29 compiles per second, the final run walked the whole list
in 11 hours 16 minutes with zero failures (11,715 ok / 0 fail on
this run, plus 47 placeable edges already on disk from an earlier
partial run).
Every finished RBF is XOR-diffed against a neutral
nv_zero_global.rbf baseline, and the diff cells are stored in
results/nv_route_cells.json keyed by the edge 7-tuple. That file is
then merged with the legacy green-zone cache into
results/route_cells_full.json — a unified 7-tuple sig-cache keyed
"sx,sy,sn->dx,dy,dn,port", 13,487 merged entries. The legacy
entries get lifted with sn = 0 so the old green-zone regression
tests still pass unchanged. Cross-referenced against the full 12,259-
edge NEORV32 order list, the merged cache covers 11,762 / 12,259
(95.9%) — i.e. 100% of every edge the factory could place. The 4.1%
gap is exactly the filter set described above, not a factory miss.
Interestingly, zero of the 1725 legacy green-zone entries are hit
by NEORV32 — all coverage comes from Plan D' factory entries. The
legacy cache stays in the merged file because it is still load-
bearing for the green-zone regression suite, but it is dead weight
for real-world designs. See memory/legacy_cache_zero_nv32_hits.md.
Why the source-N dimension matters
A common question: why do we need sn in the key? Can't we just use
(sx, sy, dx, dy, dn, port) like the old cache?
The answer is that in a real CPU each LAB (logic array block) has 16
LE slots, and different slots have different downstream routing
envelopes. A flip-flop driving out of N=14 does not use the same
switch-boxes as a combinational cell driving out of N=4, even when
both live at X=5, Y=3 and target the same destination port.
Collapsing those two into one key would make the cache give the wrong
answer for one of them. Keeping sn explicit costs us a few MB of
JSON and buys correctness for every source that isn't at N=0.
The hero test: X=5, sn=4, on silicon
With the factory warmed up we picked the first edge that satisfied
three criteria: (a) its source column was outside the CE6
whitelist (a "jailbreak" column we only unlocked by lying to Quartus
and claiming the chip is an EP4CE10); (b) its source-N was
non-zero (so the 7-tuple path was actually exercised, not the legacy
sn=0 fallback); (c) the factory had already produced the
corresponding nv_pair RBF on disk so we had ground truth to diff
against.
The edge was ROUTE X5Y3N4 -> X4Y3N6.datad. Source X=5 is a column
the Quartus CE6 software whitelist forbids — it simply refuses to
place a LUT there. But we already knew (from the 2026-04-07 jailbreak
probe) that X=5 is perfectly functional on silicon, the restriction
is pure software. Plan D' deliberately routes through X=5 by lying
about the device.
We wrote the hero test as a single line of FASM:
ROUTE X5Y3N4 -> X4Y3N6.datad
That file went into fasm2rbf.py, which looked the 7-tuple key up in
the merged cache, copied the 144 cells it found, XOR'd them into
nv_zero_global.rbf, patched the CRAM CRCs frame by frame, and wrote
a 368,011-byte RBF. We then compared this file byte-for-byte against
the factory's ground-truth nv_pair_X5Y3N4_to_X4Y3N6_datad.rbf:
CRAM band (bytes ≥ 5282): 0 differing bytes ← exact match
Header band (bytes < 5282): 6 differing bytes ← Quartus device-id / seed
bad CRC frames: 0 / 1727 ← all pass
The CRAM is the part of the RBF that the FPGA's configuration state machine actually validates. Zero CRAM differences means our FASM-generated RBF is functionally identical to Quartus's own output. The six header-band differences sit in bytes 43–74, which carry Quartus's compile timestamp and seed hash — the configuration state machine never looks at them.
We flashed the FASM-generated RBF to the AX301 via openFPGALoader.
It loaded cleanly, Done, no CRC error, no EPCS fallback, the FPGA
drove its LED pins with the expected constant outputs from the two
LUTs. First silicon proof that:
fasm2rbfreproduces factory-grade CRAM from a 7-tuple cache hit- Plan D' cells are silicon-accepted even outside the training corpus
- The CE6-forbidden column X=5 configures and runs under FASM control
The hero test validated the whole stack in one flash.
Two negative results worth remembering
Negative result 1 — passive R4 dark-index mining is impossible.
Our routing bit model _R4_BASE_PREV has 24 of 37 theoretical R4
switch I-indices mapped; the 13 "dark" indices never appeared in the
green-zone corpus because Quartus never picked them under low routing
pressure. We hoped that a full NEORV32 RBF, compiled under real
congestion, would light up the dark indices and let us recover their
BASE addresses by XOR-diffing against a neutral baseline. It did not
work. The NEORV32 diff set turned out to be so dense (113k cells, ~4%
of all CRAM bits) that any candidate BASE address scores a 55-61%
hit rate across tested wires by pure chance. A null test confirmed
the method cannot even recover the BASEs we already know to be
correct for I=0, I=1, I=2, I=10. Lesson: passive observation
needs a sparse signal. Dense diffs drown out the pattern you are
looking for. We marked Task G closed-negative and moved on — the
24/37 coverage turned out to be unnecessary anyway because the
signature cache short-circuits the formula path for every route it
has seen.
Negative result 2 — "schedule the biggest hubs first" is wrong. When the factory was 22% through its 12k edges, we asked: could we finish the hero-test-relevant coverage faster by reordering the remaining 9,500 compiles to do the highest-fanout sources first? The intuition was "main roads before alleys" — finish the big architectural hubs early, let the small leaf sources wait. We wrote a pure simulator (no side effects on the live factory) and ran it. The result was the opposite of the intuition: fanout-first order delays the "sources fully covered" metric by up to 4.2 hours compared to the current lexicographic order. The reason is that while the factory is spending 12 minutes grinding through a single 203-edge hub, the current lex order would be finishing ~60 small sources in the same window. Lex order accidentally clusters small sources at the head of the sorted list and is near-globally optimal for this metric. Lesson: never propose a scheduling change without a simulator proving it is strictly better on the target metric; "obvious" hub-first heuristics can be wrong.
Both negative results are archived in memory/r4_dark_passive_mining_dead.md
and memory/fanout_first_scheduling_worse.md for the benefit of
anyone who considers the same approaches in the future.
A postscript: what "stuck at 96%" actually meant
A small debugging story from the day the factory finished, because it
teaches a lesson that is almost more useful than the Phase 4.5 result
itself. The factory had been running in the background all day. At
the end of the afternoon we checked in and the progress counter read
11,762 / 12,259 — 95.9%, apparently stuck with nothing happening,
and an hour later it read exactly the same number. The knee-jerk
reading was "the factory crashed at 96% and left 497 edges unfinished,
we need to restart it and investigate".
We almost did exactly that. What stopped us was checking the actual
log file (tmp/nvfac.log) before touching anything. The log showed
a perfectly clean final line:
== done == ok=11715 fail=0 jb_fail=0 elapsed=676.5min
The factory hadn't crashed — it had finished normally, 11 hours and 16 minutes after launch, with zero failures. So where did the "497 missing edges" come from?
It came from reading two different denominators and assuming they
were the same number. The progress counter we were watching reports
len(done) out of 12,259 (the raw edge count from the STA dump).
But the factory, before it starts, filters 12,259 down to 11,762
by removing edges its two-LUT compile template cannot place — odd-N
self-mappings, IO-ring coordinates, non-LAB columns. The filtered
edges never enter the work queue, so they never get marked done, so
len(done) asymptotically approaches 11,762, not 12,259. Once
the factory hits 11,762 / 12,259 it is complete, not stuck.
The lesson: when a long-running pipeline "freezes" near the end, read the actual log before you restart anything. A progress counter whose denominator is wrong looks identical to a crashed process whose numerator got stuck — both produce the same flat number on your status check. The difference is exactly one grep of the log. Restarting a pipeline that has already finished is at best wasteful (you spawn 12 Quartus workers for no reason) and at worst destructive (if the "fix" touches the checkpoint file you can lose the work the pipeline already did). The reflex "something looks wrong, let me restart it" is one of the most expensive reflexes in long-running computing, and almost every time the right first move is instead "something looks wrong, let me read the log".
This also explains why our README Phase 4.5 section quotes coverage as "11,762 / 12,259 (95.9%) — 100% of placeable edges". Both numbers are true simultaneously: the factory achieved 100% of what it could, and that 100% is 95.9% of the original edge list. Stating only the 95.9% makes the result look worse than it is; stating only the 100% hides the 4.1% of NEORV32 structure that our current compile template cannot reach. Both numbers are worth writing down.
The 24-bit counter that wouldn't blink: a missing primitive, not a codec bug (2026-04-11)
This is the most expensive lesson in the project so far, and it's worth telling at university-textbook level because the same trap is going to catch every open-source FPGA toolchain that bolts an old vendor chip onto a generic place-and-route engine.
Setup. Phase 5.3 finally had the whole open flow assembled: Yosys
synthesizes the Verilog into LUT4+DFF cells; nextpnr-generic places
them on the EP4CE6 with our hand-written chipdb_gen.py; np2fasm.py
walks the routed JSON and converts each LE+arc into FASM directives;
fasm2rbf.py consumes the FASM, looks up our 13,487-entry sig-cache
for each route, and writes a CRC-valid 368,011-byte .rbf. The smoke
test was the simplest sequential design we could think of:
module counter_top(input CLK, output LED); reg [23:0] cnt; always @(posedge CLK) cnt <= cnt + 1; assign LED = cnt[23]; endmodule
24-bit counter. With a 50 MHz clock the top bit toggles roughly 3 times a second — the LED should be visibly blinking. The build ran end-to-end in about 5 seconds: 31 LUT directives, 24 DFF directives, 97 ROUTE directives, all CRC frames clean, all LI MUX safety checks green. We flashed it.
The LED was constantly on. We tried again. Constantly off. We tried 10 different rebuilds with various phase-ordering and stripping fixes; the LED stayed in one state or the other but never blinked.
The stretch of red herrings. Each time we flashed and saw a
constant LED, we assumed the codec was almost right and one more
small fix would make it run. We chased — and actually fixed — five
real bugs in fasm2rbf.py and the sig-cache:
-
The
LutCodecwas using a "union of all minterm patterns" set to figure out which CRAM cells belonged to a LUT. That works on a sparsely-populated LAB (one or two LEs in use), but the counter needed 30 LEs in two adjacent LABs, and the 50+ LAB-shared bits in each LE's calibration started cross-contaminating each other. Workaround: usepredict_sram(0xFFFF), which XOR-cancels every cell that appears in an even number of minterm patterns and leaves exactly the 16 true truth-table cells per LE. -
160 sig-cache entries for routes inside the (X=4, Y=18) and (X=4, Y=19) LABs were missing. We re-mined them with a clean two-LUT pair template and got beautiful 135-cell-per-entry results.
-
A "self-loop mining template" we found in an earlier session had actually been mining the wrong port — its Verilog put a flip-flop between the two LUTs and hard-coded
lut2.dataa(reg)regardless of which port the caller asked for. The 160 entries it produced were unusable. (We threw them out and re-mined cleanly.) -
Per-LAB clock-distribution cells overlapped, in two cases, with the truth-table cells of one specific LE (X4Y19N4). The build was clearing the clock cells when it reset the LUT region. Fix: SET the per-LAB CLK after the LUT phase, not before.
-
The sig-cache mining baseline was
nv_zero_global.rbf, which itself contains a small lut1+lut2 stub at (X10Y10/X10Y11). Routes mined against it leak 1-3 LI MUX cells in those baseline LABs. Fix: post-bitgen, walk the LI structure and toggle any cells in non-design LABs back to baseline.
Every one of those fixes was real. None of them was the actual problem. After applying all five, the LED was still constant.
The thirty-second test we should have run on day one. Eventually,
out of frustration, we did the obvious thing: take the same Verilog
above, hand it to Quartus directly, and flash whatever Quartus
produced. The Quartus build was a 368,011-byte .rbf, just like
ours. We flashed it.
It blinked. Visibly, at about 3 Hz, exactly as expected.
So the silicon worked. The clock worked. The pin map (CLK on E1,
LED on G15) was right. The openFPGALoader was right. The board
was right. The Verilog was right. The only thing wrong was our
bitstream.
That meant we could now compare two .rbf files for the same
Verilog: ours and Quartus's. We diffed each against the empty
baseline nv_zero_global.rbf and counted cells:
Quartus reference (blinks): 367 cells, mostly in CRAM cols 47-48
Our codec build (constant): 1,185 cells, mostly in CRAM cols 4-7
Cells in common: 55
The two builds barely overlapped at all. They weren't fighting over the same region of the chip — they were placing the design in completely different physical locations using completely different LE primitives.
The actual root cause. Cyclone IV LEs have a special direct
wire called the "carry chain": each LE's cout output goes
straight into the next LE's cin input as a dedicated wire that
does not pass through the local interconnect MUX at all.
Hardware adders use this to propagate the carry bit at the speed
of a wire, instead of the speed of a routing decision.
Quartus, when it sees cnt + 1, recognizes that this is an
arithmetic operation, switches the LE into "arithmetic mode," and
chains 24 LEs in a column with cout → cin direct wires. One
LE per counter bit, no LI MUX at all for the carry signal.
Yosys + nextpnr-generic don't know any of this. Our chipdb_gen.py
declares the LEs and the LI MUX wires and the C4/R4/R24 routing
tracks, but it does not declare the carry-chain cout → cin
direct wires, because we never modeled them. So when Yosys saw
cnt + 1, it had no carry primitive to map to, and it expanded the
addition the only way it knew how — into ordinary 4-input LUTs.
A ripple adder where each output bit is computed as something like
A ⊕ B ⊕ Cin, and the carry-out is (A ∧ B) ∨ (Cin ∧ (A ⊕ B)).
Each counter bit needed about 4 LEs to express that, so the 24-bit
counter exploded into 30+ LEs. And each bit needed its own previous
value as an input — which means a wire from the LE's flip-flop
output back into one of its own LUT input ports. A self-loop.
That's where our toolchain hit a wall it couldn't get over with any amount of patching. The sig-cache mining template is built on "compile two LUTs at two different locations and diff the resulting bitstreams against the empty baseline." It cannot represent a self-loop, because you can't put two distinct LUTs at the same LE coordinate. We tried a different template that swaps the same LUT between an external input and a self-feedback input — but between the two compiles Quartus was free to re-pick I/O pins, re-route everything, and the diff included so much unrelated noise that the resulting "self-loop entries" were 100-700 cells of random junk instead of the small handful of LI MUX bits we actually needed.
Without clean self-loop sig-cache entries, the 24 self-feedback routes the design needed delivered no signal. Without those signals, every counter bit's flip-flop saw a constant input. The flip-flops latched their power-up value and never changed. The LED stayed on the power-up state of bit 23 — which happened to be 1 with one build and 0 with the next.
The lesson, in three sentences.
When an open-source toolchain produces a bitstream that "should" work but doesn't, always cross-check against the vendor's own build of the same Verilog as ground truth before patching the codec. A 30-second compile of the test design in Quartus, followed by a flash and a byte-diff, will tell you immediately whether you're hunting a codec bug (cells in the right column, wrong values) or a missing-primitive bug (cells in the wrong column entirely, because the front-end emitted a different topology). The two cases need very different fixes, and treating them the same wastes days.
For students reading this — there's a subtler lesson underneath.
A modern FPGA is not "a sea of LUTs and a routing fabric." It's a
deliberately heterogeneous collection of primitives: LUTs, FFs,
carry chains, BRAMs, DSP multipliers, PLLs, IOBs, GCLK trees. The
vendor's tools know every one of those primitives exists and treat
them as first-class citizens. A generic place-and-route tool only
sees what your chipdb tells it about. Anything you forgot to put
in the chipdb, the vendor will quietly out-perform you on by 3-10×
in cell count and infinity-times in performance. The whole point
of the next phase of this project (Phase 5.4) is to teach the chipdb
about the carry chain so that cnt + 1 becomes 24 LEs in a column
again, the way it physically wants to be.
This is also why open-source FPGA toolchains have historically focused on the smallest possible devices first. iCE40 has almost no heterogeneous primitives — it's mostly LUTs, FFs, and BRAMs — and that's why Project IceStorm could land a complete open flow first. Cyclone IV is one or two device generations richer (it has carry chains, DSP multipliers, M9K BRAMs, PLLs, soft I/O standards), and each one of those richer features is a separate cliff that the generic flow falls off until someone teaches the chipdb about it. The good news is that each cliff is climbed exactly once: once you have a carry primitive in your chipdb, every future design that does arithmetic gets it for free.
The fixes we earned during the wild-goose chase are still valuable
for any future design that needs to share a LAB between many LEs —
the LutCodec workaround, the cleanly re-mined inter-LE pair entries,
the per-LAB clock ordering rule, the post-bitgen LI cleanup.
None of them fix the counter, but together they form a working
template (tmp/m5_counter/build_counter_sigcache.py) for
high-density combinational and FF-only designs. Phase 5.4 will turn
the carry chain into the next cliff we climb.
Phase 5.4 follow-up: the carry chain cliff, climbed (2026-04-13 → 04-14)
The counter forensics story above ends at the edge of a cliff. This section is what happened on the climb. It is written for students and tries to stay concrete.
Where the arithmetic bits actually live. Phase 2.4 once claimed
"92 arithmetic/carry-chain bits per LE" inside each LE's CRAM region.
That finding turned out to be wrong — it had been mined with
VIRTUAL_PIN, which (as Phase 5.0 later discovered for M9K/DSP) makes
Quartus emit ghost routing cells that disappear the moment you
recompile with real pins. When we re-mined with real pins, the
per-LE arithmetic cells vanished; they had never been there.
What is actually there: arithmetic mode is a LAB-level mode switch, not a per-LE setting. When any LE in LAB (X, Y) turns on arith mode, a specific pattern of ~100 bits lights up in the block band (frames 1692-1738) — the same region of the bitstream that enables M9K RAM blocks and DSP multipliers. There are no per-LE arithmetic CRAM cells. The mental model: a LAB has 16 LEs all sharing the same arith-mode configuration, so the bits that configure it are stored once per LAB, not 16 times. This mirrors how a CPU works — you don't have a separate ALU for every register pair, you have one ALU with a mode field.
The blob is position-independent across LABs. Once we had the
~100 cells that activate arith mode in LAB (4,18), we asked: does
LAB (10,18) use a different 100 cells? LAB (4,10)? The "triangle
test" (2026-04-14) built the same 8-bit counter at all three LABs and
diffed each against its own identity twin. All three diffs produced
byte-identical cell sets — the same 100 offsets in the block band
light up regardless of which LAB hosts the counter. We named this the
v4 universal blob and shipped it as
results/arith_blockband_v4.json (100 SETs + 4 CLEARs). Teach the
FASM codec one table and it works everywhere on the chip.
The blob is per-WIDTH, not per-N-slot. A 16-bit counter needs 197 block-band cells, not 100. A 24-bit counter crossing two LABs needs 295. So the blob depends on how many LEs are in the carry chain — but does it also depend on which N-slots within a LAB you use? A Cyclone IV LAB has 16 LEs at N slots 0, 2, 4, …, 30; placing eight of them in the lower half (N=0..14) versus the upper half (N=16..30) is two physically distinct placements.
Phase 1 sweep (2026-04-14, 42 Quartus builds, no hardware): for each
width w ∈ {2, 3, …, 16}, build a w-bit counter at LAB (4,18)
twice — once lower-half, once upper-half. The fitter report confirmed
both placements were honored. Then diff each counter against a
matching identity design. Result: at every width, the lower-half
diff and the upper-half diff were byte-identical — same offsets,
same bit positions. Placing the same eight LEs in a different half
of the same LAB does not change a single arith bit in CRAM. We had
feared needing to mine 2^16 N-slot combinations; it turns out a
per-width table (one entry per chain length) is enough. That table
now lives at results/arith_blockband_by_width.json and covers
widths 2..16 single-LAB plus the 16+8 cross-LAB case; round-trip
verification (apply blob to identity, diff against counter) gives
zero data and zero block-band differences for every entry.
Two myths debunked on the way.
Myth 1 — "every LE has a FF-enable CRAM bit." We tried to mine
that bit three different ways and always got noise. Cross-checking
against Quartus: every Cyclone IV LE has a flip-flop that is
always physically present. Whether you use the flip-flop or
the combinational output is selected by downstream routing, not by a
CRAM bit. Our old dff_cells_mined.json was routing infrastructure
noise. The FASM DFF directive is now a parsed no-op.
Myth 2 — "carry chain needs external feedback routes." An N-bit
counter is Q <= Q + 1, so each FF's Q output feeds back into the
ALU's B input. Our early Yosys techmap inserted a "Route-A buffer"
LUT to carry that feedback through the local interconnect. Doing so
doubled the LE count and created self-feedback routes that the
sig-cache cannot mine cleanly. When we looked at Quartus's own
counter: zero external route cells for the feedback. Cyclone IV
has an internal wire from the FF output directly into the ALU's B
input; no LI MUX is involved. The fix in synth/ep4ce6_map.v was to
bind the FF's Q directly to CE6_CARRY.B with no intermediate
buffer. An 8-bit counter now uses 8 LEs and 0 route cells, same as
Quartus.
What's on silicon right now. On 2026-04-13 we flashed an 8-bit
counter assembled entirely from FASM (identity baseline + eight
LUT_ARITH = 0x0000 directives) onto an AX301 board. The LED
blinked at the expected rate, behavior bit-identical to Quartus's
own compile of the same Verilog. An identity Q <= Q negative
control produced a dark LED. That is the full proof: the block-band
arith blob is the real activation, the universal blob works at the
chosen LAB, LE-internal feedback is sufficient, and the FASM
LUT_ARITH directive is end-to-end wired up correctly. Widths 9..16
and the 24-bit cross-LAB pattern are shown byte-identical to Quartus
output under diff, but await hardware re-verification when the
board is next on the desk.
One sentence take-away. The carry chain was not a second set of cells per LE (as we had guessed); it is a single LAB-wide mode switch stored in the same block band that holds M9K and DSP activation, and its bit pattern depends on the chain's length but not on which LEs in the LAB are part of it.
Open Toolchain End-to-End: Native Path and ζ Escape Hatch
Two separate paths reach silicon from the open bitstream codec — a
native path (Yosys → nextpnr → FASM) and an escape hatch (take
a Quartus RBF, diff against a baseline, emit pure BIT directives).
Both produce valid flashable bitstreams via the same fuzz/fasm2rbf.py.
The native path is the long-term goal; the escape hatch is the
guaranteed-to-work fallback for any design Quartus can build.
Native path
.v / .vhd
│
├── Yosys techmap (synth/ep4ce6_map.v, synth/prims.v)
│ → LUT4, DFF, CE6_CARRY, EP4CE6_M9K, GENERIC_IOB
│
├── nextpnr-generic (chipdb from fuzz/chipdb_gen.py)
│ → placed + routed JSON
│
├── synth/np2fasm.py
│ → FASM (LUT, ROUTE, LUT_ARITH, M9K_MODE, IOB_*, GCLK_PIN,
│ LAB_CLK_SEL, LAB_CLK_SEL_LE, OUTROUTE_G15, IOB_PAD_NV)
│
├── fuzz/fasm2rbf.py (+ patch_rbf_crc)
│ → .rbf (368 011 B, CRC-patched)
│
└── openFPGALoader -c usb-blaster
HW-validated designs along this path: registered AND gate (KEY2&KEY3→DFF→LED0) at LAB(16,4), 5-bit carry counter, M9K smoke 9×512 RAM, clock-pin pipeline on all 12 F17-reachable pins.
ζ escape hatch (Quartus gold → BIT FASM)
For any design that is too dense for the current chipdb routing model
(NEORV32-scale, ~6000+ LEs), scripts/bit_workaround/quartus_gold_to_bit_fasm.py
provides a deterministic bypass:
design.v / .vhd
│
├── Quartus compile → design.rbf (gold)
│
├── scripts/bit_workaround/quartus_gold_to_bit_fasm.py
│ → BIT-only FASM (one BIT directive per differing bit vs
│ results/rbf/nv_zero_global.rbf baseline)
│
├── fuzz/fasm2rbf.py (+ patch_rbf_crc)
│ → .rbf byte-identical to Quartus gold (cmp confirms)
│
└── openFPGALoader -c usb-blaster
This is useful because:
- It proves the codec round-trip is correct at SoC scale — the rebuilt RBF is literally the same bytes Quartus produced.
- It is a real escape hatch. Users who hit the chipdb routing wall have a bounded workflow: compile in Quartus once, everything downstream stays open-toolchain.
- The BIT FASM is an inspectable intermediate — auditable line-by-line against the codec's CRAM geometry, usable as a substrate for bitstream mutation experiments (see "Long-term direction" below).
HW-validated: two_lab AND→DFF cross-LAB route (2026-04-22), lits_pair route-family reconstruction (2026-04-23), and the full NEORV32 bootloader (4712 LE / 2367 DFF / 19 M9K) on AX301 silicon at 19200-8N1 UART (2026-04-23). ζ + fasm2rbf total wall time ≈ 0.5 s regardless of design density; it scales with RBF size (fixed 368 011 B), not LE count.
Linux extended test (2026-04-24): boot_linux.py --rbf drove the
full Quartus-flow host script against the ζ-rebuilt RBF — stage2
upload, baud switch, kernel xmodem (1.5 MB, CRC match), DTB +
initramfs all OK, Linux 6.6.83 booted on RISC-V and ran for
~150 s (devtmpfs mounted, ttyNEO0 console attached, exec'd /sbin/init)
before a kernel panic at kernel/cred.c:103. The panic is not a
ζ regression — the RBF is SHA256-identical to Quartus gold; the
panic is a RISC-V nommu kernel edge case. The ζ validation objective
(open toolchain produces a silicon-functional NEORV32 bitstream)
is met.
ζ production pipeline (CI-friendly)
The three-step ζ conversion (Quartus → BIT FASM → rebuilt RBF → flash →
UART verify) is wrapped by scripts/bit_workaround/zeta_pipeline.py
into a single command with machine-readable gates:
# RBF input, round-trip + byte-identity only (no hardware): python3 scripts/bit_workaround/zeta_pipeline.py gold.rbf # Quartus project input (runs map/fit/asm/cpf first): python3 scripts/bit_workaround/zeta_pipeline.py path/to/design.qpf # Full end-to-end with board: python3 scripts/bit_workaround/zeta_pipeline.py gold.rbf \ --flash --uart-seconds 10 --baud 19200 --expect "NEORV32"
Exit 0 iff every requested gate passed; --json emits a machine-readable
report. The pipeline hard-gates on cmp -s rebuilt gold after fasm2rbf —
a regression anywhere downstream in the codec surfaces immediately and
before any flash cycle is wasted.
Companion tools:
scripts/bit_workaround/zeta_rbf_diff.py A.rbf B.rbf— region-aware diff that splits the 368 011 B RBF into preamble / header-data / header-crc / fabric-data / fabric-crc / postamble and reports per-region byte/bit differences plus a frame histogram. Avoids the "any data bit flipped → CRC chain churn → rawcmpunreadable" failure mode.scripts/bit_workaround/zeta_selftest.py— sub-second CI-style smoke test of all three no-hardware gates against the HW-validatedtwo_lab.rbfgold (1710-bit invariant). Suitable as a pre-commit hook. Exit 0 iff the full ζ → fasm2rbf → byte-identity chain is green.scripts/bit_workaround/zeta_regression.py— iterates the pinned corpustests/zeta_corpus/manifest.json(each entry anchored by SHA256 + per-region cell counts) and asserts every fixture still round-trips byte-identical AND that its region footprint still matches the anchor. Catches drift the selftest cannot — e.g. a ζ change that breaks only single-LAB fixtures.--reanchorupdates entries markedTBD;--reanchor-allaccepts current values as new anchors (use only for intentional changes).scripts/bit_workaround/zeta_manifest_diff.py A.manifest.json B.manifest.json— diffs two pipeline manifests without touching the RBFs.zeta_pipeline.pywrites a sidecar manifest alongside every rebuilt RBF (SHA256 for gold/rebuilt/base, region cell counts, git HEAD, timestamp, gates), so bootloader v1 vs v2 comparisons becomediffof two small JSON files instead of re-scanning bitstreams. Also flags the high- severity case (rebuilt differs but gold is identical → ζ or fasm2rbf has drifted).
Pre-commit hook (opt-in)
.githooks/pre-commit runs zeta_selftest.py before every commit.
Per-clone opt-in:
git config core.hooksPath .githooks
# skip once: ZETA_SKIP=1 git commit ...The hook is defensive about gitignored fixtures: if the two-LAB gold is missing locally, it skips with a rebuild hint instead of blocking.
--rebuild-check (Quartus determinism gate)
ζ assumes "same Verilog → same gold RBF". When ζ is handed a .qpf,
--rebuild-check re-runs quartus_map/fit/asm/cpf once more and byte-
compares the two RBFs. This is the only cheap way to detect Quartus
non-determinism before it silently breaks downstream byte-identity
assumptions:
python3 scripts/bit_workaround/zeta_pipeline.py path/to/design.qpf \
--rebuild-checkWhen to use which
| Design size / routing | Native path | ζ escape hatch |
|---|---|---|
| Small (≤ 50 LE), single LAB | ✅ primary | (redundant) |
| Medium (50–500 LE), cross-LAB | ✅ if sig-cache covers routes | ✅ fallback |
| Dense (> 1000 LE) / NEORV32-class | ❌ chipdb routing model blocks | ✅ primary |
| Carry chains, M9K, clock pins | ✅ HW-validated primitives | ✅ works by construction |
The native path is still the frontier — the chipdb routing model is the sole remaining blocker for Verilog-to-silicon without Quartus. ζ closes the practical gap in the meantime.
Current Progress and Next Steps
Completed ✓
- Phase 1: Automated fuzzing pipeline
- Phase 2.1: LUT truth-table XOR-linear encoding model (16 bits × 376 positions = 100%)
- Phase 2.2: Complete CRAM address model (X/Y/N 3D formula, 376/376 verified)
- Phase 2.3: DFF configuration bit mapping
- Phase 2.4: Arithmetic mode bit mapping
- Phase 2.5: LUT TT codec (read/write verified, 10 masks bit-identical)
- Phase 3.1: C4 I=0 switch address model (63 wires, 0 false predictions, universal formula across 22 columns)
- Phase 3.2: R4 switch address model framework (slot/group formula + PREV column location)
- Phase 3.3: R4 slot=1 offset correction (bp = 6-group, 0%→78% fix)
- Phase 3.4: R4 I-index mapping — 18/37 mapped (13 via R4_BASE_PREV slot/group formula: I=0,1,2,4,7,10,14,15,17,18,20,22,25; +5 via per-(X,I) corpus mining: I=3,11,12,13,16)
- Phase 3.5: LOCAL_INTERCONNECT switch modeling (70% cross-validation, 22 columns, 4 pair activation patterns)
- Phase 3.6: Routing codec (RouteCodec read/write methods: C4/R4/LOCAL_INTERCONNECT)
- Phase 3.7: R24 I=0 fixed-byte model (~66% pair-diff accuracy, 73% of R24 wires)
- Phase 3.8: C4 I≠0 per-(X,I) fixed-byte lookup (24 entries, 11 I-indices)
- Phase 3.9: RouteCodec round-trip self-consistency (0 dropped, 0 hallucinated on column + row)
- Phase 3.10: LOCAL_INTERCONNECT base-granularity read API (one entry per (pair, base) cell)
- Phase 3.11: LI encoding modes resolved — paired vs alternating, uniform 9-cell envelope
- Phase 3.12: Hardware safety guard V2 with signature recognition (
validate_safe_for_hardware) - Phase 3.13: End-to-end hardware verification on AX301 (codec → flash → expected logic)
- Phase 3.14: Route synth island hopping — 15 green-zone source LABs (4,4), (10,4), (10,10), (10,14), (13,10), (16,4), (16,8), (16,14), (19,14), (22,12), (22,16), (25,6), (28,10), (28,18), (31,12) — 686/686 routes bit-perfect against Quartus, fingerprint drift = 0
- Phase 3.15: EP4CE6 RBF CRC fully reverse-engineered (CRC-16/IBM, poly 0x8005, init 0xFE54, reflected, per 210-byte frame, frames 25..1751). Patcher integrated into codec; 1727/1727 CRAM frames verified
- Phase 3.16: Hardware loopback closed — RouteCodec + LutCodec output flashes successfully on real EP4CE6 silicon after CRC patch (no more EPCS fallback)
- Phase 3.17: AX301 pin map silicon-verified via
pin_probe.py(KEY1=E15, KEY2=E16, KEY3=M16, KEY4=M15, LED0=G15) - Phase 3.18a: EP4CE6 ≡ EP4CE10 confirmed same physical die — byte-identical RBF (incl. device ID); enables CE10 as "jailbroken Quartus" for fuzzing CE6's restricted regions (
fuzz/cross_device_diff.py) - Phase 3.18b: Full jailbreak — CE6 fabric whitelist falsified — 2,480 hidden LEs hardware-verified alive via 3-phase XOR-chain dead-cell scanner (
jailbreak/scanC_gen.py); 6 new LAB columns (X=5,9,14,30,32,33), Y=15 row unlocked; effective fabric 392→520+ LABs, 6,272→10,320 LEs (+65%) - Phase 3.18: Functional 4-input LUT demo on hardware —
LED0 = (K1∧K2)∨(K3∧K4)written via LutCodec, full truth table validated by physical key presses
In Progress
-
[~] Phase 3.19: Map remaining R4 I-indices — 24 of 37 mapped (I=6 removed 2026-04-08 after Option-1 fingerprint recheck; later re-audit shows I=6 and I=8 are non-LAB CRAM needing a different column model). Same-day
fuzz/r4_remine.pyanalytic re-audit (942-route STA corpus vsroute_cells.jsonabsolute cells, zero differential bias) reversed the earlier per_route_delta audit: I=0/1/2/4/7/10/13/15/16/17/18/20 hit 60-97%, table is healthy for most LAB-CRAM entries. Confirmed bad entries: I=12 (29%), I=14 ((3191,3191) broken), I=6/I=8 (non-LAB CRAM). The formula is unused byroute_synthbecause the signature backend short-circuits before it runs, not because it's broken. 13 remain unmapped (5,9,24,28,29,30,31,32,33,104,116,125) — STA-corpus blocked -
Phase 3.20: M9K/DSP boundary column fix (X=13/26 large columns need sub-region address model)
-
Phase 3.21: C16 long-distance wire modeling (not yet started)
-
Phase 3.22: LI mode-selection rule — CLOSED NEGATIVE. T9 + T10 orthogonal-grid corpus (12 sources, 374 compiles, 414 mappable rows,
fuzz/li_mode_grid_mine.py+li_mode_analyze.py+li_mode_tree.py). Clean rules:dy∈{2,3,21}→edge_even_b0(100%),adx==0→paired(79%),dx>30∧dy>7.5→paired. Middle leafdy>3∧dx≤24.5∧adx>0.5(n=247, 60% of corpus) stuck at 52% coin flip — unchanged by 2× corpus growth and sx/dx decorrelation. Conclusion: paired vs alternating is not a function of the static routing key; likely driven by Quartus placement seed / LI channel occupancy. Further corpus expansion will not help. Yellow-zone fallback keepspairedas a weak prior (both modes are hardware-safe). -
Phase 3.23: C4 I≠0 fog-of-war sweep —
fuzz/c4_inz_sweep.pymined 19 new (X,I) mappings from existing routing_paths corpus, taking_C4_FIXED_OFFSETSfrom 25 → 44 mappings. Green-zone regression still 58/58 bit-perfect. -
Phase 3.24: Non-LAB column identity resolved —
jailbreak/probe_blocks.v(12× altsyncram + 8× lpm_mult, virtual-pinned). Quartus placed blocks atM9K_X15_Y*,M9K_X27_Y*,DSPMULT_X20_Y*. So of the 3 true non-LAB columns (post-jailbreak): X=15 and X=27 are M9K RAM columns; X=20 is the embedded 9×9 multiplier column. PLLs live at the die periphery, not in any X column. -
Phase 3.25: Jailbreak fabric CLOSED on silicon (2026-04-07) — both axes silicon-validated end-to-end through the codec. X=32 column: LCCOMB_X32_Y10_N0 mask 0x8888 ran on AX301; codec calibrated,
COLUMN_BASEextended to all 28 LAB columns at standard 7350-byte stride. Y=15 ghost row: LCCOMB_X10_Y15_N0 mask 0x0357 =(K1∧K2)∨(K3∧K4)ran on AX301 (fuzz/demo_y15_keys2led.py). +65% fabric is production-ready on real CE6 silicon -
Phase 3.26: Route-synth green zones expanded 3 → 15 source LABs (
fuzz/fingerprint_raw_mine.pycodec-blind XOR mining, header filter); 686/686 routes bit-perfect.results/r4_iindex_table.json(942 entries) silently used byroute_synth.py:206for I-index hint selection per (src,dst,port) geometry -
[~] Phase 3.27: M9K CRAM probe — superseded by Phase 5.0. Legacy
m9k_probe_mine.pyarchived 237/299 cells; Phase 5.0 proved 76-81% were CRC byte ghosts and CRC-stripped them to 58 (identical forGLOBAL_ONandCOL15_ON— no column-specific signature actually exists). Y-position model still abandoned; STA wire path also proved a dead end (see Phase 5.0) -
Phase 5.0: Non-LAB blocks (DSPMULT + M9K) — real-pin re-mine, 2026-04-08. LOC syntax cracked (
fuzz/{mult,m9k}_loc_discover.py): hierarchical MegaFunction node path, not coordinate alias — DSPMULT useslpm_mult:u|mult_qpl:auto_generated|mac_mult1, M9K usesaltsyncram:u|altsyncram_3ov:auto_generated|ALTSYNCRAM, 42 legal DSPMULT sites (Y1..21 × N{0,1}) + 126 legal M9K sites (X∈{15,27}). Two new rules established:- Never mine non-LAB blocks with VIRTUAL_PIN — the first-pass 62-cell
MULT_GLOBAL_ONunder VIRTUAL_PIN was a Quartus ghost-routing hallucination with 0 overlap vs a real-pin recompile. RetaggedMULT_VIRTUAL_PIN_ARTIFACT(fuzz/mult_noise_test.py, memoryfeedback_virtual_pin_mining_is_fiction.md). - Non-LAB mining must filter CRAM-only (off≥5282) — 5-seed null-hypothesis test (
fuzz/mult_header_noise.py) on identical lpm_mult designs proved bytes 44 and 73 have a 4-5 bit per-SEED noise floor touched by every non-LAB block and by FF arst/ena. Any "header-band finding" without CRAM-only filter is fiction. Killed the earlierSIGNED_CORE = 4 cellsresult fromfuzz/mult_param_sweep.pyand puts the FF layer-1 byte 44/73 claim under re-audit (memoryfeedback_header_band_noise_floor.md). - Results (CRAM-only, CRC-stripped):
MULT_GLOBAL_ON_REAL= 29 cells universally toggled by any DSPMULT (42-site sweep);M9K_GLOBAL_ON= 58 cells CRC-stripped, cross-validated 55/58 against the fresh 126-site real-pin sweep;MULT ∩ M9K = 0(per-block disjoint); X=15 and X=27 universals are byte-identical (confirming no column-specific M9K signature). - Two non-LAB config bands discovered: (1) block enable/mode band, frames 1692-1738 hosts mult 29 + M9K 58 in disjoint positions; (2) block clock-net band, frames ~1007-1013 —
fuzz/mult_reg_sweep.py(lpm_pipeline∈ {1,2,3}) isolated the first single-bit semantic fieldDSPMULT_CLOCK_ENABLE = (209891, bp 4), and M9K's 4 per-site clock bits sit 1-3 bytes away from the mult clock bits in frames 1010/1013 at the same bp=4 — slot-reserved per-block clock register. - Dead ends: STA wire extraction (
fuzz/mult_sta_wires.py) returns only chip-edge IOBUF — Quartus treats DSPMULT as a black-box cell; width/signed/pipeline decoding beyond CLOCK_ENABLE is buried in header noise;altpllhas no X/Y LOC (off-fabric, needsPLL_1/PLL_2singleton names), deferred
- Never mine non-LAB blocks with VIRTUAL_PIN — the first-pass 62-cell
-
Phase 4: FASM toolchain CLOSED on silicon (2026-04-08) —
fuzz/fasm2rbf.py+fuzz/rbf2fasm.pyimplement a minimal FASM dialect (LUT,ROUTE,BIT,SRC) drivingLutCodec+RouteCodec+patch_rbf_crc. Signature backend (fuzz/route_signatures.py, 1725 route cell-sets) short-circuitssynth_routefor yellow-zone and Y=15 jailbreak sources. Port-MUX consolidated loader (2026-04-08): every(src,dst,dn)group resolves into a sharedcommonpreamble + per-portdeltaof exactly 4 cells (2 adjacent byte pairs at 840-byte LI-pair×4 spacing); 225/225 full 4-port groups match a "3+1" equivalence class with datab always the odd port.route_signatures.load_cells()now prefersresults/route_cells_consolidated.json(34% file / 37% cell savings) with invariantcommon ∪ port_delta[p] == route_cells[key+",p"]self-tested 1725/1725. Set-cover decomposer (fuzz/route_decompose.py) collapses multi-route + cross-source CRAM diffs into clean directives. Regression suite: 1725/1725 single-route, 41/42 multi-route (1 pre-existing), 3/3 cross-source, 15/15 green-zone islands (686/686) — all bit-perfect. Hardware closure:X10Y10N0.LUT = 0x8888(AND(K1,K2)) one-liner flashed to AX301 viafasm2rbf, silicon behavior matched -
Phase 4.5: Plan D' sig-cache — NEORV32 coverage (2026-04-09) — 12-worker parallel factory (
fuzz/plan_d_prime_factory.py) compiled 11,715 placement-forced 2-LUT pairs from NEORV32 STA edges. 7-tuple sig-cacheresults/route_cells_full.json= 13,487 entries (legacy 1725 lifted to sn=0 + 11,762 factory). Coverage: 95.9% of NEORV32 edges (100% of placeable). Hero test X=5 jailbreak column FASM → AX301 silicon-accepted. -
Phase 5.0: Non-LAB blocks (DSPMULT + M9K) — real-pin re-mine (2026-04-08) — see In Progress section above for full detail
-
Phase 5.2: M9K init content codec — Stage A+B CLOSED (2026-04-09) — 3-band partition (data/mode/clock); 2D linear formula
byte(w,bit) = anchor + (w//2)*210 - (w%2) - 2*bit, bp=6; 31 NEORV32 M9K sites calibrated (M9K_INIT_ANCHORS= 33 entries); LOC fix (use instance name-to "u"); READ 512/512, WRITE 0 CRAM diffs vs Quartus.fuzz/m9k_init_basis.py -
Phase 5.4: LE carry chain in open flow — HARDWARE-VERIFIED on AX301 (2026-04-13) — arith activation is a LAB-level mode switch in the block band (frames 1692-1738), not a per-LE cell region; position-independent across LABs (v4 universal blob,
arith_blockband_v4.json); per-width table covers widths 2..16 single-LAB + 16+8 cross-LAB (arith_blockband_by_width.json, all entries round-trip zero-diff vs Quartus). Four pieces landed: chipdbcout→cinpips (8,126), CE6_CARRY techmap primitive (LE-internal FF→ALU feedback, no Route-A buffer), np2fasmLUT_ARITHemission, fasm2rbfLUT_ARITHdirective. Identity + 8×LUT_ARITH=0x0000blinks on AX301 bit-identically to Quartus counter RBF; identityQ<=Qnegative control stays dark. DFF confirmed silicon-default (no per-LE CRAM enable cell — FASMDFFis now a parsed no-op). Pedagogical narrative at "Phase 5.4 follow-up" section above.
Future Work
-
Phase 5.1: Complete routing codec coverage (target: all wire types >90%; C16 + remaining R4 I-indices still open) — distinct from the already-done Phase 5.0 non-LAB work
-
Phase 5.2b: Non-LAB block parameter decoding beyond CLOCK_ENABLE and M9K INIT — need an intra-block differential probe that bypasses the header noise floor, STA opacity, and the lack of observable per-site configuration; PLL probe via
PLL_1/PLL_2singleton LOCs deferred here -
[~] Phase 5.3: Open-source toolchain — Yosys + nextpnr-generic + FASM (PARTIALLY OPEN, arithmetic designs now hardware-verified via Phase 5.4). Target: replace Quartus with
Verilog → Yosys → nextpnr-generic → np2fasm → fasm2rbf → openFPGALoader. Current state:fuzz/chipdb_gen.py: generates nextpnr-generic Python chipdb (8,241 bels, 59,611 wires, 1.38M pips) with GCLK broadcast, intra-LAB direct pips, 4-level pip cost hierarchy (SIG=1 < INTRA=2 < LOCAL=5 < HOP=20), plus 8,126cout→cindirect pips for carry chain (Phase 5.4).synth/ep4ce6_map.v+synth/prims.v+synth/synth_ep4ce6.ys: Yosys techmap chain (LUT4 + DFF + CE6_CARRY for$alu). Run viasynth/synth_ep4ce6.sh— the wrapper envsubst's$HOME/$NEORV32_ROOTso the VHDL paths travel.synth/np2fasm.py: extracts logical connectivity from nextpnr routed JSON, looks up sig-cache for FASM ROUTE directives, walks the carry chain and emitsLUT_ARITHfuzz/fasm2rbf.pydirectives that work end-to-end:LUT,ROUTE(6/7-tuple),GCLK,DFF(parsed no-op — FF is silicon default),BIT,SRC,LUT_ARITH,M9K.INIT_{w}x{d}(for 33 calibrated 9x512 anchors; FASM round-trip verified infuzz/test_m9k_init_directive.py, 5/5). CRC patcher integrated. np2fasm M9K emission is a stub (_emit_m9k_init+ xfail testfuzz/test_np2fasm_m9k.py) — Yosys$__M9K_SP_techmap rule drafted insynth/ep4ce6_map.vbehindM9K_TECHMAPifdef, chipdb M9K wire pips still TODO.- M5 counter — 8-bit counter now hardware-verified via the open flow (2026-04-13). The FASM path (identity baseline + 8×
LUT_ARITH = 0x0000) blinks on AX301 with bit-identical behavior to Quartus's own compile. Widths 2..16 single-LAB and the 16+8 cross-LAB case are byte-identical to Quartus output underdiff; hardware re-verification pending. See Phase 5.4 follow-up section above for the climb. - Real fixes earned chasing M5 (still useful for future multi-LE-per-LAB designs): LutCodec high-density LAB workaround (
predict_sram(0xFFFF)filters LAB-shared cells); sig-cache mining template pitfall documented (must usegen_two_luts_single_input_clockedfromverilog_gen.py); 160 cleanly re-mined (4,18)/(4,19) inter-LE pair entries added toroute_cells_full.json; per-LAB CLK ordering fix (must run after the LUT phase reset); post-bitgen LI cleanup for sig-cache infrastructure leakage. Working multi-LE-per-LAB build template attmp/m5_counter/build_counter_sigcache.py. - IOB FASM cell map landed (2026-04-14):
IOB_IN PIN_X/IOB_OUT PIN_Xdirectives reproduce all 44 single-axis ground-truth RBFs bit-perfect via XOR delta fromiob_in_E15.rbfbaseline.np2fasmemits one directive per placed GENERIC_IOB. Cross-axis pin combos still leak ~57 joint-placement bytes (needs 2D K×LED sweep). - GCLK pipeline landed + HW verified (2026-04-14):
GCLK_PIN+LAB_CLK_SEL+LAB_CLK_SEL_LEFASM directives compose as XOR-delta on an AUTO-mode baseline; source encoding is per-pin (E1=3 cells, R8=5, N1=38; zero cross-pin overlap within the legacy E1/R8/N1 triad). 26 (LAB, N) combinations round-trip bit-perfect;fasm2rbf11/11 +np2fasm7/7 tests. Retiring the legacynv_zero_global.rbfbaseline is pending the next HW flash pass. - GCLK + IOB_CLK_INPUT extended to all F17 dedicated clock pins (2026-04-15):
GCLK_PINandIOB_CLK_INPUTnow cover 12 clock pins each — E1, R8, N1 plus 9 newly mined dedicated clock pins (M1, M2, T4, R4, M16, M15, E15, A14, B14). Two of the 13 dedicated F17 pins are unfittable: PIN_E2 (the LVDSCLK_00P side of the E1 diff pair — Quartus refuses placement) and PIN_H1 (reserved asALTERA_DCLKJTAG config). Generalised mining tools:scripts/iob_slice_mining/compute_clk_pin_hdr.py --build --pin {PIN}(IOB_CLK_INPUT, parallelisable, ~16 s/pin) andfuzz/clk_pin_autoforce_probe.py --pin {PIN}(GCLK_PIN, 6 builds × ~16 s/pin). Tests:fuzz/test_iob_baseline_nv_directive.py15/15 (per-pin loader + per-pin gold-RBF round-trip across all 12 mined pins) +fuzz/test_gclk_pin_directive.py11/11 (per-pin loader sanity). Notable: same-bank dedicated clock pins (E15/M15/M16; A14/B14) share spine cells (12-22 cells overlap), unlike the disjoint legacy triad — XOR semantics still compose cleanly under double-emit, but a multi-GCLK_PIN design will see partial cancellation rather than a clean union. - IOB→SLICE route mining, HW-verified template (2026-04-14):
scripts/iob_slice_mining/— paired two-LE mining template (template_pairs.py+mine_iob_routes.py) produces diff-able pair-vs-zero deltas (~200 cells/entry) AND functional silicon (paired RBFiob_pair_E16_10_4_0_dataa.rbfflashed on AX301 drives KEY2→LED0 correctly). 3-layer decomposition (decompose_deltas.py) splits every raw delta into universal_infra (98 cells) ∪ pin_footprint(pin) ∪ pure_common(target) ∪ ≤2-cell residual, verified across 15 entries. Port MUX is Quartus-canonicalized (all 4 ports → byte-identical delta). Sig-cache injection pending (pure_common is relative toiob_zero, notnv_zero_global). - IOB_ROUTE FASM directive + frame-split bridge + single_le sweep (2026-04-15):
IOB_ROUTE PIN_X -> XaYbNc.portwired intofasm2rbf(8/8 tests); CRAM band bit-perfect vs the HW-verified pair RBF viaabsolute_cells.IOB_BASELINE_NV(132-bit-cell / 74-byte hdr bridge fromnv_zero_globaltoiob_in_E15) andIOB_CLK_INPUT PIN_{E1,R8,N1}(40 / 64 / 70-cell clock-bank pin activate, mined per-pin viascripts/iob_slice_mining/compute_clk_pin_hdr.py --build --pin {PIN}) close the frame-split so end-to-end FASM designs can build on a singlenv_zero_globalbase. An opt-insingle_le_cellssection inresults/iob_to_slice_sigcache.jsonoverridesabsolute_cellswhen present and strips pair-template secondary-LE decoration for single-LE designs — derived by solvingIOB_ROUTE_primary = gold_delta ^ (all other directives)against the Quartus gold RBF.scripts/iob_slice_mining/sweep_single_le.pyparallelises this derivation across every supported (pin, target) combination: 15/15 entries now landed (3 pins E16/E15/M16 × 5 targets 10,4,0 / 10,4,2 / 10,4,4 / 10,10,0 / 16,4,0), each byte-identical to Quartus gold through the full 8-directive stack. Unlocked the last six entries via two probe-infrastructure fixes:clk_lab_sel_probe.pynow falls back toSRC_ALT=(22,10,0)whentarget_labmatches the default SRC LAB (was colliding atLCCOMB_X10_Y10_N0), andN_SLOTSnow includes N=2 soLAB_CLK_SEL_LE X{x}Y{y}N2becomes available. Tests:fuzz/test_iob_baseline_nv_directive.py13/13 (3 clock-input pins covered: E1, R8, N1, each round-tripping bit-perfect against its own Quartus gold) +fuzz/test_iob_route_directive.py8/8. - Stage 0 HW flash session (2026-04-16): 24 RBFs flashed on AX301 — 23 PASS, 1 FAIL (DSPMULT_GLOBAL_ON falsified on silicon). Key results:
NV_BASELINE_PACKsilicon-equivalent to nv_zero_global (Phase 7 retirement unblocked); 14/14 IOB_ROUTE pairs silicon-correct; 7 new GCLK_PIN clock pins programming-verified (M15 full-PASS with hold-KEY2 + pulse-KEY4 protocol; M1/M2/T4/R4/A14/B14 programming-clean); M9K smoke design accepted by chip (codec pipeline silicon-clean); DSPMULT 23-cell set leaks on silicon → bisection roadmap opened. - Stage 0 round-2 flash (2026-04-17): M9K_MODE
_inferred_goldintersectPASS — np2fasm emission ungated for all w=9 sites. IOB_OE PIN_R5 FAIL (LED stuck-on) — bisected to 2 leaky cells(363236,2)+(363672,2), cleaned 38-cell set PASS, loader masks both. LUT_ARITH_MULTI_LAB WIDTH=17 FAIL (LED stuck-off) — multi-LAB carry stays gated. DSPMULT_GLOBAL_ON bisected in 4 layers (23→12→6→3→1): leaky cell =(363236, 2)at frame 1729; cleaned 22-cell set PASS. - LAB_CLK_SEL_LE extended to N=6/8 for all 14 LABs (2026-04-16):
N_SLOTSnow(0, 2, 4, 6, 8). 56 new Quartus builds. LAB(10,16) invariant tightened 53→49 (4 cells migrated to per-LE buckets).clk_lab_sel_per_le.pyrefactored N-agnostic. 49/49 tests green. - IOB_IN_BIDIR / IOB_OUT_BIDIR directives landed (2026-04-17):
per_pin_input/per_pin_outputcell dispatch for bidirectional IOB pads (cells UNIQUE to each pin across the 33-pin sweep, no anchor double-flip). 16 sdram_dq pin coverage._IOB_BIDIR_FALSIFIEDper-pin mask table (R5 OUT: 2 fabric-band cells stripped). np2fasm emits BIDIR variants for bidir pads automatically. Tests: 5/5 directive + 6/6 np2fasm. - IOB_OE FASM directive landed (2026-04-16):
IOB_OE PIN_Xfor 16 NEORV32 sdram_dq pins. Specimen-factory mining (oe_on vs oe_off per pin, 3-seed routing-invariance probe, 0-drift across all 16). Cell counts 37..55 per pin, 21-cell universal intersection. HW bisection at R5 isolated 2 leaky cells; loader masks them. 9/9 tests. np2fasm emission not yet wired (needs Yosys$tribuftechmap). - Formula-based LutCodec landed (2026-04-17):
LutCodec.from_cram_model(x, y, n)eliminates per-LAB SQLite calibration. Uses CRAM address model to generate synthetic minterm patterns.fasm2rbf.pybitgen falls back automatically whenfrom_db()raises ValueError. All 65536 masks match DB-backed codec at (10,10,0). Known limitation: pair mapping is WRONG for positions other than (10,10,0) — 192/233 LUTs produce incorrect truth tables. Root cause: bit-to-cell pair ordering varies by (x,y) in ways the formula doesn't capture. This is the primary blocker for the pipeline test. - Sig-cache demand mining expanded to 38,683 entries (2026-04-18→19): Route mining from NEORV32 STA edges brought the 7-tuple sig-cache from 13,487 to 38,683 entries. 0 route sig-cache misses for NEORV32 v2 build. 8 IOB→SLICE misses remain (J16/M2/E16 → Y=21 targets).
- M9K pipeline closed end-to-end (2026-04-16): Full Yosys →
memory_libmap→ prepack_m9k → np2fasm → fasm2rbf path produces CRC-valid RBFs. Smoke design (9×512 RAM) round-trips user data pattern correctly. Three np2fasm fixes landed (blackbox module selection, Yosys binary int parsing, x/z char handling). M9K_MODE_inferred_goldintersectemission ungated for all w=9 sites (HW-validated). - NEORV32 open-toolchain RBF auto-reset on flash (2026-04-18): Both v2 and v3 RBFs cause FPGA auto-reset to factory config. Root cause: chipdb LOCAL bus has only 4 tracks (~2080 wires total) vs real silicon's O(100k) routing resources. At 6500+ LEs, nearly every LOCAL wire is overused → driver conflicts → protective reset. All structural safety checks PASS; the problem is routing model capacity, not directives. Fix requires either SIG-cache-aware placement or hierarchical routing model.
- Pipeline test E2E design (134-LE, 2026-04-18→19): 28-bit counter → LED heartbeat + UART TX "Hi!\r\n" + KEY3/KEY4 passthrough. Quartus gold PASS on silicon. Open-toolchain build: 0 route misses, but FPGA RESET on flash. Root cause:
from_cram_model()pair mapping bug — 192/233 LUTs use wrong bit-to-cell mapping, corrupting LUT functions. K2→LED3 and K4→LED2 work (partial pipeline success), but F16/G15 outputs fail. F16 output routing mined differentially (40 data cells: 38 header + 2 block band), but adding them triggers reset due to cumulative LUT layer damage. - Carry chain disabled for NEORV32 (2026-04-17):
alumaccremoved from Yosys flow — 684 chain discontinuities → LUT4 arithmetic instead. Reduces to 6533 LEs (292 fewer). CE6_CARRY infrastructure retained for future arch work.
-
Phase 5.4: LE carry chain in the open flow (HARDWARE-VERIFIED 2026-04-13) — arith mode activation lives in the block band (frames 1692-1738, bp=2), not in LAB CRAM columns, and is a per-LAB mode switch, not a per-LE cell. Four pieces landed: (1)
chipdb_gen.pydeclares 8,126cout→cindirect pips between adjacent LE bels; (2)synth/ep4ce6_map.v+synth/prims.vadd the CE6_CARRY primitive so Yosys lands$aluon chained LEs with the FF'sQwired directly intoCE6_CARRY.B(no external "Route-A" buffer); (3)synth/np2fasm.pywalks the carry chain and emitsLUT_ARITHdirectives; (4)fuzz/fasm2rbf.pyapplies the arith blob fromresults/arith_blockband_v4.json(universal, position-independent at any LAB) for 8-LE half-LAB chains, or fromresults/arith_blockband_by_width.json(widths 2..16 single-LAB + 16+8 cross-LAB) for other chain lengths. AX301 silicon-accepted: identity + 8×LUT_ARITH=0x0000blinks bit-identically to Quartus's counter RBF; identityQ<=Qnegative control stays dark. -
Phase 6: σ⁻¹ 3-key LutCodec discovery (2026-04-21) + gap closure (2026-04-24) — the long-outstanding "pair mapping wrong" bug in
LutCodec.from_cram_model()closed by adding a third discriminator axis: the previous(foff, fb8)2-key table lookup was ambiguous across Y-groups, and addinggroup = (y-2)//3as the third key resolves it. σ⁻¹ tableresults/sigma_inv_fb8_groups.jsongrew from 1,904 to 2,112 entries on 2026-04-24 as the Y=3 wrap gap (+80 entries: wrap usesaddr_adj=206and includes boundary N=12; Y≥6 slot=1 groups still use 207 and strict<) and the Group-4 × fb8∈{0,1,3,4} gap (+128 entries via alternate-X FACE probes at X=11/16/12/17; primary narrow-column reps X=3/6/4/7 didn't fit the 16-LUT template) both closed. 5-level fallback chain retained. Residual (not closable): fb8=7 × group=4 is silicon-geometry blocked — X=8 is the sole fb8=7 column and has no LAB at Y≥12 (Quartus rejectsLCCOMB_X8_Y{14,16}_N*on both CE6 and CE10); those 32 positions fall back to nearest-group (group=3 fb8=7). -
Phase 6b: End-to-end HW validation on AX301 (2026-04-21 → 2026-04-22) — three designs proven silicon-functional through the full open toolchain: (1) registered AND gate (KEY2&KEY3→DFF→LED0) at LAB(16,4), 10 FASM lines incl. multi-port IOB_ROUTE, 0 fabric diffs vs Quartus gold; (2) 5-bit carry counter at LAB(16,4) N=0..8, 18 FASM lines, 0 ROUTE directives (LE-internal carry feedback); (3) two-LAB cross-LAB AND→DFF→LED with BIT-only reconstruction from Quartus gold (byte-perfect vs gold, HW-verified). This is the first cross-LAB fabric route proven on silicon via the codec path.
-
Phase 6c: chipdb 26-track upgrade (2026-04-22) — LOCAL bus widened from 8 to 26 synthetic tracks, total pips grew to 3.6M; routing graph is now closer to real Cyclone IV's ~40-LI-wire-per-LAB topology. Runner drives P&R end-to-end on the upgraded chipdb. Small-design HW validation passed; dense-design (NEORV32) routing is still blocked — the model is denser but still simpler than the real C4/R4/R24/LI switch matrices.
-
Phase 7: ζ BIT-workaround — open-toolchain escape hatch HW-validated end-to-end on NEORV32 (2026-04-23) —
scripts/bit_workaround/quartus_gold_to_bit_fasm.py+fasm2rbf.pyround-trip takes any Quartus-produced RBF and rebuilds it byte-identically (emits oneBITdirective per differing bit vsnv_zero_global.rbfbaseline, CRC-patched). HW-validated at NEORV32 scale: 4712 LE / 2367 DFF / 19 M9K / 51 pins → 127 728 BIT directives (2634 hdr + 113 573 fab + 11 521 crc), ζ + fasm2rbf wall time ≈ 0.5 s. The rebuilt RBF boots the NEORV32 bootloader cleanly on AX301 at 19200-8N1 UART (banner + auto-boot countdown + SPI-flash probe + CMD prompt). Linux extended test (2026-04-24): kernel + DTB + initramfs transferred via xmodem, Linux 6.6.83 ran ~150 s on RISC-V (devtmpfs mounted, ttyNEO0 console attached, exec'd /sbin/init) before a kernel-levelkernel/cred.c:103BUG_ON panic unrelated to the bitstream (RBF SHA256 matches Quartus gold). This is the first SoC-class validation of the escape hatch; users blocked by the chipdb routing model have a proven bounded workaround. -
Simple_led-class rescue + M9K_MODE width sweep + pragma channel (2026-04-24) — three follow-ups earned after the ζ escape hatch shipped:
- Fix A (commit
8c660ef):bitgen(..., legacy_iob_route=True)restores the pre-6b6cda9 IOB_ROUTE apply-path for simple_led-class single-LE designs (pure XOR parity, no dedup, no hdr-skip). The default path stays correct for pair-derived / IOB_PAD_NV designs (two_lab, NEORV32 ζ, multi-LE).simple_ledw=9/w=18 probes now rebuild byte-identical to HW-PASS references. - Fix B (commit
af22c9f):scripts/iob_slice_mining/sweep_single_le.pyre-mines against the legacy apply-path with new--orphans-only/--include-knownflags; 109single_le_cellsentries (X∈{3,4,6,7,8,10,16} × Y∈{4,10,17,18,19,21}) all byte-identical to cached Quartus gold. Loader priority is nowsingle_le_cells > single_le_cells_stale > absolute_cells. Every sigcache key is routable again. - M9K_MODE widths HW sweep (commit
f22b884): overlay probes on thecff800eHW-PASS w=9 baseline validated (9,1024) PASS and (36,256) PASS on AX301 silicon. (4,2048) FAILs silicon (LED0 stuck on, KEY2 inert — 24-cell gi bucket rejected); gated OFF innp2fasm._M9K_MODE_HW_VALIDATED. HW-validated set is now{(9,512), (18,512), (9,1024), (36,256)}. Overlap with base is NOT the silicon-safety discriminator ((36,256) has 0 overlap and passes; (4,2048) has 0 overlap and fails). - np2fasm pragma channel (commit
612c520):np2fasm --legacy-iob-route/convert(legacy_iob_route=True)prepends# fasm2rbf: legacy_iob_route=1.fasm2rbf.parse_pragmas(text)round-trips the pragma into a kwarg dict. Callers explicitly forward tobitgen(**pragmas)— no magic auto-override. 6/6 tests.
- Fix A (commit
Long-term direction: what this enables, and what it won't
A common question: with the codec working, can modern ML (RL routing, GNN congestion prediction) outperform Quartus? The honest answer has three layers.
PPA is out of reach. Quartus has a 30-year-old hardware-calibrated timing model, a complete legality checker, and routing algorithms (PathFinder + negotiated congestion) that have proven hard to beat on industry benchmarks — whether academic RL routers can close the gap remains an open research problem. Trying to out-route Quartus on its home turf is a known dead end.
What the codec does uniquely enable is bit-level bidirectional
modification of a shipped bitstream — microseconds to mutate, seconds to
validate on silicon. Quartus is a one-way verilog → bitstream pipeline;
we are not. That gap enables:
- Bitstream-level mutation and equivalence framework. Take a Quartus build, apply cell-level equivalent transforms (LUT-mask rewrites, redundant routing-bit removal), verify equivalence on hardware, keep mutations that reduce cell count or power. Expected PPA wins from cell-level peepholing are small (Quartus output is already near-locally optimal); the real value is as a research substrate for post-fit optimization and differential equivalence testing that Quartus cannot expose.
- Workflows Quartus does not expose. Offline bitstream mutation and replay: modify specific frames in a known-good RBF and re-flash on next power cycle. This is not partial reconfiguration (Cyclone IV lacks ICAP), but it enables things Quartus's single-shot flow rules out — applying ECO patches without re-running fit, reproducible bit-identical builds (Quartus is seed-dependent; the codec is a pure function), and bitstream watermarking in don't-care LUT bits.
- Open toolchain (the actual prize). A working Yosys + nextpnr-EP4CE6 flow matters an order of magnitude more than any PPA play. It is the first time Linux/macOS users can target this chip without Intel's tools, the first time CI can build EP4CE6 bitstreams reproducibly, and the first time the Cyclone IV E family enters the open-source FPGA ecosystem (Project Mistral brought Cyclone V partway there before us).
Where ML fits. A modest supporting role: a decision-tree classifier to replace hand-coded LI envelope rules once the corpus is big enough; a small-tree pattern miner (not GNNs) for the paired-vs-alternating selection rule so the result compiles directly into the codec; an anomaly detector for codec-built RBFs that fail to flash. None of this is "ML beats Quartus" — it is "ML helps write rules we do not want to hand-derive."
Priority. Finish Phase 5.3. The .v → bitstream open flow is already
most of the way there (chipdb + techmap + np2fasm working, counter
routing, 8-bit counter HW-verified). Once it runs end-to-end, the question
shifts from "can we beat Quartus on PPA" to "what can we do that Quartus
won't do at all" — and the codec is what answers that.
Overall Progress Estimate
Percentages across different domains are not comparable (denominators differ wildly — bits, cell types, route count, design size). This table reports coverage (what's concretely counted) and status (HW-verified / round-trip-clean / partial / not started) rather than a single headline number.
| Domain | Coverage | Status |
|---|---|---|
| CRAM address mapping | 22 cols × 18 rows × 16 LEs = 376/376 (CE6 whitelist) + X=32/33 and Y=15 post-jailbreak | HW-verified |
| RBF CRC | Spec fully derived (CRC-16/IBM, 0x8005, init 0xFE54, frames 25..1751); 1727/1727 verified | HW-verified |
| Logic configuration (LUT/FF/arithmetic) | LUT TT decoded at all LE positions; FF is silicon-default (no CRAM); arith mode = block-band blob | HW-verified |
| LE carry chain in open flow | chipdb cout→cin pips (8,126), CE6_CARRY techmap, LUT_ARITH FASM directive, per-width table (2..16 + 16+8 cross-LAB), v4 position-independent blob |
HW-verified (8-bit counter, 2026-04-13) |
| FASM toolchain (Phase 4) | fasm2rbf + rbf2fasm + set-cover decomposer; 1725/1725 + 41/42 + 3/3 + CE6 686/686 round-trip |
HW-verified (AND(K1,K2) on AX301) |
| Hardware loopback (codec → flash → silicon) | LutCodec + FASM path both running on AX301 | HW-verified |
| C4 routing switches | I=0 closed-form formula; I≠0 covered by 44-entry per-(X,I) lookup + sig-cache | Closed-form partial, sig-cache production |
| LOCAL_INTERCONNECT | Base-granular read/write; two encoding modes resolved; V2 safety guard | Round-trip clean |
| R4 routing switches | 25/37 I-indices mapped; remaining 12 blocked on corpus, not method | Partial |
| R24 long-distance wires | I=0 fixed-byte model, ~73% of wires | Partial |
| C16 long-distance wires | — | Not started |
| Bitstream codec | LUT TT + routing read/write; round-trip self-consistent; HW safety V2; CRC patcher integrated | HW-verified |
| Route synthesis (green islands) | CE6 standard 15 islands 686/686 bit-perfect; jailbreak/edge 9 islands 45/45 via snapshot fallback. Total harness 731/731 | Closed (2026-04-14) |
| FASM sig-cache (Phase 4.5) | 38,683 entries (expanded 2026-04-19); 7-tuple (sn>0 supported); 0 route misses for NEORV32 v2 | Production |
| M9K init codec (Phase 5.2) | 2D linear formula; 33+5 anchor entries (incl. 18×512); M9K pipeline closed end-to-end (Yosys→prepack→np2fasm→fasm2rbf) | HW-validated (chip accepts open-toolchain M9K RBF, 2026-04-16) |
| M9K_MODE (Phase 5.2) | _inferred_goldintersect site-invariant buckets per (w,d); np2fasm emission gated on _M9K_MODE_HW_VALIDATED = {(9,512),(18,512),(9,1024),(36,256)}; (4,2048) FAILs silicon (gated off) |
HW-validated 4/5 widths (2026-04-17, 2026-04-24) |
| GCLK pipeline (Phase 5.4) | GCLK_PIN (12 pins on F17) + LAB_CLK_SEL + LAB_CLK_SEL_LE N∈{0,2,4,6,8}; XOR-composed on AUTO baseline |
HW-verified (14 LABs × 5 N-slots; Stage 0 flash 2026-04-16) |
| IOB FASM (Phase 5.4) | IOB_IN/IOB_OUT 44/44; IOB_IN_BIDIR/IOB_OUT_BIDIR 16 sdram_dq pins; IOB_ROUTE two apply-paths (default pair-derived + Fix A legacy_iob_route=True for single-LE designs); single_le_cells 109 entries Fix-B re-mined against legacy path; IOB_OE 16 pins |
IOB_ROUTE HW-verified both paths; BIDIR/OE codec-verified + bisected on silicon |
| DSPMULT (Phase 5.0) | 22-cell silicon-clean set (23 mined − 1 falsified via bisection at frame 1729) | HW-bisected; np2fasm not wired (0 DSPMULTs in NEORV32) |
nv_zero_global retirement |
NV_BASELINE_PACK directive + sub-directives reproduce the Quartus baseline byte-exact from PURE_ZERO |
HW silicon-equivalent confirmed (Stage 0 flash 2026-04-16) |
| Formula-based LutCodec (σ⁻¹ 3-key) | from_cram_model(x, y, n) with 3-key σ⁻¹ table ((foff, fb8, group)), 2,112 entries, 5-level fallback; Y=3 wrap + Group-4 fb8∈{0,1,3,4} gaps closed 2026-04-24 |
Production (2026-04-24); residual silicon-blocked fb8=7 × group=4 (X=8 has no LAB at Y≥12) |
| Open-source toolchain — native path (Phase 5.3) | Yosys → nextpnr-generic (chipdb 26 LOCAL tracks, 3.6M pips) → np2fasm → fasm2rbf. AND gate + 5-bit carry counter + M9K smoke HW-validated at single/cross-LAB scale | HW-verified for small/medium; chipdb routing model still too sparse for NEORV32-class density |
| ζ BIT-workaround — escape hatch (Phase 7) | scripts/bit_workaround/quartus_gold_to_bit_fasm.py + fasm2rbf rebuilds any Quartus RBF byte-identically. 127k BIT directives for NEORV32; 0.5 s wall time |
HW-validated end-to-end on NEORV32 bootloader (4712 LE / 19 M9K) on AX301, 2026-04-23 |
References
- Cyclone IV Device Handbook
- Project IceStorm — iCE40 reverse engineering, methodology reference
- Project Mistral — Cyclone V reverse engineering, same chip family
- Quartus Prime Lite — Free FPGA development tool
Dead Ends Worth Remembering
Reverse engineering is mostly finding out which attractive hypothesis is wrong. The ones that cost real time, recorded so the next person does not repeat them:
-
M5 counter carry-chain detour. Built a 24-bit counter through the open toolchain, could not get it to match Quartus's RBF. Spent a stretch of the project patching
LutCodec, re-mining sig-cache entries, and chasing phase-ordering bugs infasm2rbf. Root cause was none of those — Quartus places the design using LE-internal carry-chain wires that nextpnr-generic does not model, so Yosys emulates+1as a 4-LE ripple with 24 self-feedback routes. The codec fixes we landed along the way were real improvements, but the real blocker was an unmodelled primitive, not a codec bug. Lesson: when your open build of design D misbehaves, flash Quartus's RBF for the same D first and diff the two bitstreams before patching anything. -
IOB cross-axis linear superposition. Plausible hypothesis: a design driving (KEY_X, LED_Y) should factor as (KEY_X-only) ⊕ (LED_Y-only) ⊕ baseline. Falsified — bank-pair lookup also failed. The residue is ~50-60 bytes of joint-placement state that neither model captures. Closing it requires a full 2D K×LED sweep (~480 pair builds), currently in progress. Derived models are not coming back; do not retry them.
-
R4 dark passive mining. Tried to recover R4
BASEconstants by counting bit density in a full NV32 RBF. RBF is too dense — signal-to-noise is below the mining floor. Dead end. -
T9 LI paired-vs-alternating as a function of the routing key. Mined, structurally audited, falsified — the choice is not a function of
(src_type, src_I, dst_N, dst_port). Stop mining this axis; the missing variable is elsewhere. -
DFF per-LE enable CRAM bit. Chased for a long stretch before realising the flip-flop is intrinsic to every Cyclone IV LE and has no per-LE enable cell. The original
dff_cells_mined.jsonwas routing-infrastructure noise with zero overlap against any real design. The FASMDFFdirective is now a parsed no-op. -
Self-loop sig-cache entries via the two-LUT pair template. The template cannot represent
src == dst, and Quartus refits between baseline and feedback compiles, so the diff includes pin reassignments unrelated to the LI MUX. The 61 self-loop entries inroute_cells_full.jsonare bloated noise (90-754 cells vs corpus median 135) and cannot be repaired by re-running the factory. Needs a single-LE differential strategy. -
DSPMULT_GLOBAL_ON 23-cell set — falsified on silicon (2026-04-16). The re-mined 23-cell "universal block enable" looked clean: CRAM-only, CRC-stripped, 21/21 N-invariant, zero routing drift. Stage 0 flash on AX301 → LED stuck constant-on. Bisected in 4 layers down to a single cell
(363236, 2)at frame 1729. The 22-cell cleaned set PASSes silicon. The leaky cell sits inside the DSPMULT block-band region but its exact semantic is unknown. Lesson: even a "clean" mining campaign with stable intersections can harbour a single load-bearing cell that interacts with unrelated fabric paths. Always silicon-validate before ungating np2fasm emission. -
IOB_OE PIN_R5 — failed on silicon, bisected (2026-04-17). The 40-cell per-pin OE set for sdram_dq S_DB[0] passed all codec safety gates (0 fabric/hdr/block overlap with simple_led_pure). Flash → LED stuck constant-on. Bisected to 2 cells
(363236,2)+(363672,2); cleaned 38-cell set PASSes. Same(363236,2)cell as the DSPMULT leak — it appears to be a shared block-band hazard. -
LUT_ARITH_MULTI_LAB WIDTH=17 — failed on silicon (2026-04-17). The multi-LAB carry chain blob for widths 17..32 is byte-identical to Quartus output under
diff(10/10 codec tests), but flashing → LED stuck constant-off. Different failure mode from IOB_OE (stuck-on). Position-independence for multi-LAB blobs was never proven (triangle test only covered single-LAB widths ≤16). Stays gated. -
F16 output routing — mined but not integrable (2026-04-19). Differential mining (f16_loc vs f15_loc at same X7Y21N14) cleanly isolated 40 F16-specific data cells (38 header + 2 block band). The LOC-constrained f16_loc.rbf HW-verified on AX301 (LED1 responds correctly to K3∧K4). However, adding even the 16 new cells to the pipeline test RBF triggers FPGA reset — the cumulative LUT layer damage from
from_cram_model()pair mapping (192/233 LUTs wrong) means infrastructure is already in a bad state. The F16 cells themselves are correct; they cannot be applied until the LUT layer is fixed.
Individual post-mortems with cell-level detail live in memory files under
~/.claude/projects/-home-test-EP4CE6/memory/ — search for
m5_counter_root_cause_carry_chain, iob_cross_axis_not_decomposable,
r4_dark_passive_mining_dead, t9_li_mode_negative_result,
dff_perle_formula, sigcache_mining_template_pitfall,
dspmult_global_on_clean_remine, iob_oe_r5_bisection_silicon, and
f16_output_routing_mined.
Limitations and What This Is Not
So the README is honest about scope, not just progress:
- C16 long-distance wires — untouched. Zero coverage. All current routing work is C4 / R4 / R24 / LI. Designs that would route through C16 are not supported.
- Non-E-series Cyclone IV parts — unvalidated. Every silicon result in this repo is on an EP4CE6F17C8 (AX301 board). The codec formulas have not been tested on EP4CE15/22/30/40/55/75/115, nor on Cyclone IV GX. Die topology should be similar within the E family, but "similar" is not a checked claim.
- Large designs — untested end-to-end. The hardware-verified open-flow designs are small (8-bit counter, AND gate, identity-LED). NEORV32 has been synthesised and mapped, but no NEORV32 bitstream built by the open flow has been flashed and proven to boot on silicon. Larger designs may expose codec or chipdb gaps that small tests do not.
- Temperature and voltage corners — not characterised. All silicon validation is at room temperature, nominal Vccint. Behaviour under industrial temperature range or voltage droop is not measured.
- M9K BRAM in the open flow — not hardware-validated yet. Codec
and
np2fasmemission are green (5/5 tests each), chipdb has M9K bels and bridge pips, but the Yosysmemory_libmapfront-end currently rejects the mapping with "can't share write port 0: incompatible enable" — a lib/memory-shape mismatch that blocks the smoke build ontmp/m9k_smoke/ram_9x512.v. No RAM-using design has been flashed through the open flow. - PLLs — off-fabric, out of scope. Cyclone IV PLLs live outside
the CRAM region this project maps. Designs that require configured
PLLs (as opposed to the dedicated clock pins the
GCLK_PINdirective covers) are not supported. - LutCodec
from_cram_model()pair mapping — broken. The formula produces correct minterms at (10,10,0) but wrong bit-to-cell pair ordering at other positions. 192 out of 233 LUTs in the pipeline test produce incorrect truth tables. This is the primary blocker for flashing any non-trivial open-toolchain design. The pair ordering varies by (x,y) in ways the current formula doesn't capture; fixing it requires reverse-engineering the pair permutation. - chipdb LOCAL bus — undersized for dense designs. The routing model provides 4 LOCAL tracks per LAB (~2080 wires total). Real Cyclone IV silicon has O(100k) routing resources (C4/R4/R24/LI crossbar). At 6500+ LEs (NEORV32 scale), nearly every LOCAL wire is overused, causing driver conflicts and FPGA protective reset. This is a fundamental routing model limitation, not a directive bug. Fix options: SIG-cache-aware placement, hierarchical routing model, or a dedicated nextpnr-cyclone4 architecture port.
- Not a Quartus replacement. The codec is not a timing-driven place-and-route tool. Its unique capabilities are bit-level bidirectional modification of a shipped bitstream and offline mutation/replay — see "Long-term direction" above. If you need PPA-competitive synthesis, use Quartus.
License
Dual license, effective 2026-04-07 (replacing the previous MIT license):
- Code (
fuzz/,synth/,scripts/, everything that executes) —GPL-3.0-or-later. Full text:LICENSES/GPL-3.0-or-later.txt. - Documentation and prose (
README*.md,CLAUDE.md,FINDINGS.md,docs/) —CC BY-SA 4.0. Full text:LICENSES/CC-BY-SA-4.0.txt.
What copyleft covers, and what it doesn't. GPL attaches to the code
as software, and CC BY-SA attaches to the prose as a written work. Both
require downstream forks of these artifacts to stay under the same
terms. Neither license covers the methodology itself — reverse-engineering
techniques, CRAM formulas, bit offsets, and the CE10 jailbreak result
are facts, not expression, and copyright does not fence them off. We
chose copyleft anyway because it keeps the reference implementation and
the written record open, which is the part downstream users actually
rely on. If you want the methodology attached to a more durable claim,
cite the repo and the relevant FINDINGS.md entry — that is what a
defensive publication looks like.
Bitstream blobs (*.rbf, *.sof), SQLite corpora, and Quartus build
artifacts under work/ and results/rbf/ are hardware telemetry, not
creative works; no license is asserted over them, and redistribution
remains subject to Altera/Intel's original terms on their tools and
outputs.
This project is for educational and research purposes.