A complete computer in which every layer — arithmetic, OS, compiler, display — is either a trained neural network or runs entirely on GPU.
The model doesn't run on the computer. The model is the computer.
nCPU is one repository pursuing one thesis from five directions: a computer can be built out of learned components, and once the whole execution stack is differentiable, programs stop being things you write and become things you can search for by gradient descent. Each subsystem below stands on its own measurements; together they cover the stack from individual ALU operations to an operating system to program synthesis.
The five subsystems
1. The neural computer
Every ALU operation — addition, subtraction, multiplication, bitwise logic, shifts, division — is a trained neural network. The neural OS (neurOS) manages memory, schedules processes, and compiles code through 11 trained models with no hand-written fallbacks. A neural display renders characters through char→glyph MLPs and a ConvNet (143K parameters). The full pipeline — source code → neural compiler → neural assembler → neural CPU → neural display — is differentiable end to end.
The neural ALU reaches 100% accuracy on 32-bit integer arithmetic, verified exhaustively over every possible input. One result inverts the conventional hardware hierarchy: multiplication is 12x faster than addition here, because addition needs an 8-pass carry chain while multiplication decomposes into parallel byte-pair table lookups.
| Instruction | Strategy | Latency |
|---|---|---|
| ADD/SUB/CMP | Kogge-Stone carry-lookahead (8 passes) | 248 µs |
| MUL | Byte-pair LUT (65,536 entries) | 21 µs |
| AND/OR/XOR | Vectorized truth table | 21 µs |
| SHL/SHR | Attention-based bit routing | 434 µs |
| DIV | Restoring division (neural subtraction) | varies |
neurOS component accuracy:
| Component | Accuracy | Component | Accuracy |
|---|---|---|---|
| MMU | 100% | Assembler codegen | 100% |
| TLB | 99.6% | Assembler tokenizer | 99.4% |
| Cache | 99.7% | Compiler optimizer | 95.2% |
| Scheduler | 99.2% | Watchdog | 100% |
| Prefetch | 97.8% | Block allocator | 98.4% |
2. The GPU computer
A self-sufficient computer on a single GPU — the CPU is involved only at bootstrap. The Rust + Metal kernel executes about 200 ARM64 instructions (integer and floating-point) at roughly 1.9M instructions per second, with zero-copy shared memory and zero cycle-count variance across runs (σ = 0.0).
What runs on it:
- A multi-process UNIX OS: fork/pipe/wait, a 25-command shell, 28 syscalls, up to 15 concurrent processes
- A self-hosting C compiler (~4,200 lines) that compiles itself, then compiles and runs other programs — entirely on the GPU
- Real Linux binaries via an ELF64 loader: BusyBox (264KB, 34+ commands) and Alpine Linux v3.20
- 13+ compiled C applications: SHA-256, AES-128, Tetris, Snake, a Brainfuck interpreter, a Forth REPL, a CHIP-8 emulator, an HTTP server, an MNIST classifier, and others
- A 26-command deterministic debugger: instruction tracing, breakpoints and watchpoints, time-travel debugging, a memory sanitizer, automated fuzzing, reverse data-flow analysis, and constant-time verification. Deterministic execution is what makes time-travel and exact replay possible; conventional CPUs, with cache- and speculation-induced timing noise, can't offer the same guarantees.
3. Differentiable program synthesis
Given input/output examples, gradient descent discovers executable programs by backpropagating through the differentiable CPU. A candidate program is a set of continuous parameters — Gumbel-softmax distributions over opcodes, soft attention over registers — that temperature annealing collapses into discrete, runnable code.
Two synthesizers cover two benchmark suites, both at full coverage:
- Mog (grammar-constrained, differentiable compiler): 315/315 problems.
- nSynth (Rust solver portfolio): 105/105 problems on the expanded suite. (The paper's canonical suite is the earlier 95-problem version; the expanded suite adds a template-solver family.)
nSynth's coverage by solver family:
| Family | Solved | Method |
|---|---|---|
| Gradient | 66/105 | Differentiable search with a learned restart bank |
| Enumerative | 21/105 | Bottom-up expression enumeration |
| Search | 13/105 | Single-branch, struct-pair, and string teachers |
| Template | 5/105 | Pattern matching for the hardest problems |
No single family gets close to full coverage alone; the portfolio does. Key optimizations: persistent solved-program memoization (about 5000x on a cache hit), a learned bias bank with warm-refine transfer across problems, and a constant vocabulary mined from the examples themselves.
4. The differentiable coprocessor
The neural ALU injected into a transformer's forward pass as a routed expert. A learned per-token gate decides whether each token flows through the original MLP or through the neural ALU. Bilinear soft truth tables provide differentiable logic, tensor ops provide differentiable arithmetic, and gating is modulated by model confidence.
Results from an 11-model sweep across the Qwen 2.5/3/3.5 families, on arithmetic tasks:
| Model | Arithmetic accuracy | Note |
|---|---|---|
| Qwen3.5-2B (instruct) | 14.5% → 71.0% (+56.5 pp) | best overall |
| Qwen3.5-2B (base) | 15.5% → 63.0% (+47.5 pp) | 100% on ADD/SUB/MUL/DIV |
| Qwen3.5-4B | +51.0 pp | largest base-model gain (tied) |
| Qwen3.5-9B | +51.0 pp | largest base-model gain (tied) |
Real-world transfer is measured, not extrapolated: on full HumanEval (Qwen3.5-4B, A100), 62.2% → 64.6% — four additional problems solved.
5. JEPA predictive machine dynamics
A predictive world model of the computer itself. Alongside exact execution, a JEPA-style network (Joint Embedding Predictive Architecture) learns to predict machine state transitions in a compressed latent space:
latent_state_t + instruction → predictor → latent_state_{t+1}
It runs at two levels. A Python demo (ncpu/jepa_neural_cpu/) executes real
programs next to the predictor, turning prediction error into a live anomaly
signal. A Rust Metal implementation (kernels/rust_metal/src/jepa/, 2,858
lines) observes deterministic GPU execution and actively steers scheduling
through learned bias overrides.
Because the substrate underneath is exact, this world model has two properties most lack: unlimited free ground truth (run more programs), and the ability to mix predicted and exact execution at will — cheap latent speculation when exploring, exact execution when it matters. The long-term direction is a hierarchy of predictors at the bit, instruction, program, and task levels.
python3 -m ncpu.jepa_neural_cpu.demo # bottom-up JEPA neural computer demo python -m ncpu.world_model.quickstart # JEPA machine world model quickstart
Start in 60 seconds
pip install -e ".[demo,dev]" # The headline demo: the GPU as a complete computer (macOS / Apple Silicon) python -m ncpu gpu # boot it python -m ncpu gpu --neural-alu # with the neural ALU inside the Metal shader python -m ncpu gpu debug # 26-command deterministic debugger # Cross-platform, no heavy dependencies python -m ncpu discover # program by examples, via differentiable synthesis python -m ncpu text --interactive # neural text / cipher machine # Full neural pipeline (requires the model stack) python -m ncpu full-neural # bottom-up neural CPU + neural display python -m ncpu meta-compare # side-by-side comparison demo # JEPA predictive layer python3 -m ncpu.jepa_neural_cpu.demo python -m ncpu.world_model.quickstart # Rust-native, no Python required cd kernels/rust_metal cargo run --bin ncpu_run -- --elf ../../demos/gpu/busybox.elf --rootfs -- echo hello
Three execution modes
| Mode | What runs | Differentiable? | Speed |
|---|---|---|---|
| Neural | 13 trained .pt models |
yes — full gradient flow | ~5K IPS |
| Fast | native tensor ops | yes — standard autograd | ~5K IPS |
| Compute | Rust + Metal shader | no (discrete hardware) | ~1.9M IPS |
All three execute the same programs and produce the same results. Neural mode sends every operation through trained networks. Fast mode uses native tensors with the same ISA and the same differentiability. Compute mode trades gradient flow for speed — it is where the UNIX OS boots, the compiler self-hosts, and BusyBox runs.
# Neural mode — every operation is a trained model from ncpu.model import CPU cpu = CPU(neural_execution=True) cpu.load_program("MOV R0, 7\nMOV R1, 6\nMUL R2, R0, R1\nHALT") cpu.run() print(cpu.get_register("R2")) # 42 — computed by the neural byte-pair LUT # Differentiable coprocessor — inject into any Hugging Face model from ncpu.coprocessor import inject_ncpu_coprocessor, NCPUCoprocessorConfig config = NCPUCoprocessorConfig(confidence_aware=True, deterministic_alu=True) inject_ncpu_coprocessor(model, config) # Differentiable program synthesis from ncpu.differentiable import ProgramSynthesizer, SynthesisSpec spec = SynthesisSpec(examples=[ ({0: 3.0, 1: 5.0}, {2: 8.0}), ({0: 7.0, 1: 2.0}, {2: 9.0}), ]) synth = ProgramSynthesizer(max_program_len=6) result = synth.synthesize(spec, max_iters=2000) # discovers: ADD R2, R0, R1; HALT # JEPA world model — predict machine state transitions from ncpu.world_model.je_world_model import JEWorldModel, JEWMConfig model = JEWorldModel(JEWMConfig(state_dim=22, action_dim=8)) pred = model.predict_next_latent(model.encode_state(state), model.encode_action(action))
The full stack
| Layer | Implementation | Result |
|---|---|---|
| ALU | 13 trained .pt models |
Exact 32-bit integer arithmetic, exhaustively verified |
| OS | neurOS — 11 neural models, no fallbacks | Learned MMU, TLB, cache, scheduler, compiler |
| GPU compute | Rust Metal kernel, ~200 ARM64 instructions | Arbitrary programs at ~1.9M IPS |
| UNIX OS | Compiled C on Metal | fork/pipe/wait, 25-command shell, 28 syscalls |
| Compiler | cc.c, ~4,200 lines, self-hosting | Compiles itself, then compiles programs — on GPU |
| ELF loader | Real Linux binaries on GPU | BusyBox and Alpine Linux v3.20 on Metal |
| Coprocessor | Neural ALU in a transformer forward pass | Tokens routed through neural arithmetic, measured gains |
| JEPA | Predictive world model of machine dynamics | Latent speculation + anomaly detection over an exact substrate |
| Program synthesis | Backprop through execution | Programs discovered from I/O examples |
| Constant-time crypto | AES-128 ECB/CBC (ncpu/crypto/) |
σ = 0.0 timing; FIPS 197 + NIST SP 800-38A vectors pass |
| Multi-GPU | Distributed cores with shared memory | fork/pipe/wait across GPUs; parallel and pipeline execution |
| SOME | Hidden controller with latent heads | Self-optimizing inference; HumanEval+ and BigCodeBench gains |
Timing side-channel immunity
GPU execution here produces zero cycle-count variance — σ = 0.0 across 270 runs, where the same code on native Apple Silicon shows 47–73% timing variance. With no data cache there are no cache lines and no cache-miss penalty, so AES T-table attacks have nothing to measure.
Built on that property, ncpu/crypto/ provides constant-time AES-128 (ECB and
CBC) from 19 constant-time primitives, passing all FIPS 197 and NIST SP
800-38A test vectors.
Self-Optimizing Machine Engine (SOME)
A hidden controller that turns part of the neural machine into an internal coprocessor for code generation: a buffered think → write → verify → patch → commit loop, learned action/halt/descriptor/state-patch/memory heads, and task-local fast weights updated during inference. The learned memory head improved validation MSE by 83.26% over baseline.
Measured end to end: HumanEval+ for qwen3.5:4b improved 147 → 154 and for qwen3.5:9b 144 → 156; BigCodeBench-Hard for qwen3.5:9b improved 33 → 49.
MUXLEQ: Turing-complete in two instructions
SUBLEQ plus MUX, running in all three execution modes. In neural mode, SUB
goes through the Kogge-Stone carry-lookahead (~248 µs) and MUX through neural
AND/OR/NOT (~63 µs). It loads .dec images and boots eForth. The point: if
trained networks exactly execute a two-instruction one-instruction-set
computer, the construction extends to any instruction set.
Program synthesis from examples (nsynth_codegen)
cargo build --release --bin nsynth_codegen ./target/release/nsynth_codegen --lang python --examples '{ "name":"square","signature":"fn square(x: i64) -> i64", "examples":[{"inputs":[0],"expected":0},{"inputs":[3],"expected":9}] }' # → def square(x: int): return (0 * x * x) + (1 * x * x) + 0
Project structure
ncpu/
differentiable/ # Differentiable execution, program synthesis, ISA discovery
coprocessor/ # Inject nCPU into transformer forward passes
execution_training/# Differentiable execution as training signal for code LMs
crypto/ # Constant-time crypto (AES-128)
distributed/ # Multi-GPU distributed execution
jepa_neural_cpu/ # Bottom-up JEPA neural computer demo
world_model/ # JEPA machine world model (predictive dynamics)
autoresearch/ # Automated research + compounding NPCoT loop
os/
neuros/ # Neural OS: 17 modules (MMU, TLB, cache, scheduler...)
gpu/ # GPU UNIX OS: shell, filesystem, ELF loader, C source
self_optimizing/ # SOME: hidden controller, fast weights
neural/ # NeuralCPU: neural ALU bridge, weave pipeline
model/ # Model-based CPU (neural_ops, assembler)
tensor/ # Tensor-based ARM64 emulator (differentiable)
# Compiled / accelerated backends
kernels/ # rust_metal (Rust+Metal ARM64 kernel), mlx, npcot_wasm
nsynth/ # Rust program synthesizer (gradient + enumerative + search)
packages/ # Companion packages (metal_mlp)
# Models & synthesis corpus
models/ # Trained neural-component weights (see models/MODEL_INDEX.md)
programs/ # Synthesis benchmark corpus (arithmetic, bitwise, algorithms, ...)
# Evidence, paper, experiments
artifacts/ # Committed benchmark results cited by the paper + tests
paper/ # Research paper + modular sections
benchmarks/ # Benchmark driver scripts
experiments/ # Exploratory experiment runs
# Usage & ops
examples/ # Minimal runnable demos (one per execution path)
demos/ # Larger showcase walkthroughs (BusyBox, Alpine, compiler)
scripts/ # Entry points + maintainer automation
tools/ # Developer tooling
training/ # Training pipelines
packaging/ # Deployment scaffolding (Homebrew, Modal, DEPLOYMENT.md)
# Tests, docs, assets
tests/ # Test suite (see tests/README.md)
docs/ # Documentation
assets/ # Logos / static assets
# Build & runtime output (gitignored — regenerable, not committed)
checkpoints/ # Large .pt weight checkpoints
training_results/ # Coprocessor scaling sweeps, ablation studies
dist/ # Build distributions
logs/ outputs/ # Run logs and scratch outputs
Every top-level directory has its own README.md describing its purpose.
Tests
python -m ncpu doctor
pytest tests/ -q # 2,500+ tests across the stackCoverage spans exhaustive formal verification of the ALU, neural ops, neurOS, compute mode, multi-process execution, MUXLEQ, BusyBox/Alpine, the GPU debugging toolkit, the coprocessor, Mog synthesis, differentiable execution, constant-time crypto, self-modifying programs, the diff compiler, multi-GPU distribution, SOME, and the JEPA predictive models.
Documentation
- Research paper — the full analysis and findings
- GPU debugging toolkit paper — the 26-command GPU-native debugger
- GPU debugging toolkit reference — command reference
- Rust Metal kernel — architecture, zero-copy design, build instructions
- Compilation pipeline — end-to-end C-to-GPU flow
- JEPA neural CPU — bottom-up neural computer architecture
- JEPA machine world model — predictive dynamics design
- Model index — complete trained-model inventory
- SOME complete guide — hidden controller and training pipeline
- Differentiable programs — program optimization, synthesis, ISA discovery
- Benchmark results — pass@1 numbers for every mode and model tier
License
MIT
