Computers !

44 min read Original article ↗

CDNA (CDNA 1 and CDNA 2) is AMDs compute-focused GPU architecture powering the Instinct MI100 and MI200, building on matrix cores and Infinity Fabric connectivity to link chiplets, HBM stacks, and SIMD pipelines for large-scale sparse and dense matrix workloads.

Graphics Core Next architecture is a 28nm AMD GPU microarchitecture powering Radeon HD 7000 series and FirePro W9000 boards; it uses compute units with scalar and vector execution to accelerate graphics and OpenCL workloads.

AMD RDNA (Navi 10) is a 7nm architecture with 36 compute units and 32-wide SIMD wavefronts that deliver the rasterization and general-purpose throughput powering Radeon RX 5700-class GPUs, exposing asynchronous compute engines, command processors, and refined geometry pipelines for modern gaming and compute workloads.

AMD RDNA2 architecture powers the Radeon RX 6000 series, combines dedicated ray accelerators with AMD Infinity Cache, and balances throughput across gaming and compute workloads.

AMD RDNA3 uses 5nm+ Navi 3x chiplets with improved shader engines, wider workgroup processors, chiplet-based Infinity Cache, and ray accelerators to deliver higher throughput and lower power for the Radeon RX 7900 series.

Upcoming Navi 4x iteration of the RDNA lineage, the AMD RDNA4 architecture, expands its ray accelerator fabric and AI feature suite while remaining a deterministic, irreversible, exact execution platform for advanced graphics and compute.

The HD 5000-era Terascale architecture couples 40nm VLIW5 shader arrays with Eyefinity-aware raster and memory subsystems, providing a general-purpose GPU fabric where GPGPU shader arrays power multi-display graphics and OpenCL-style compute domains.

The 2017 Zen architecture organizes 14nm FinFET chiplets into 8-core CCXs with SenseMI telemetry, simultaneous multithreading, and Infinity Fabric links to the I/O die, delivering responsive general-purpose compute.

AMD Zen 2 is a 7nm chiplet-based x86 architecture that couples multiple 7nm CPU chiplets, increased shared cache capacity, PCIe 4.0 lanes, and a robust Infinity Fabric interconnect to deliver high throughput for 2019 high-throughput platforms.

AMD Zen 3 increases IPC, optimizes its cache hierarchy, and leverages Precision Boost to sustain higher operating frequencies for premium desktop and EPYC platforms on the 7nm process.

AMD Zen 4 is a next-generation x86 microarchitecture powering Ryzen 7000 and EPYC Genoa families. Built on a 5nm FinFET CMOS chiplet architecture, it pairs a stronger DDR5/LPDDR5 memory controller with AVX-512 improvements and delivers next-generation Ryzen and EPYC throughput.

An aggressive out-of-order Cortex-A15 pipeline with virtualization extensions, powering Samsung Exynos 5 Dual 5250 and sometimes paired with Qualcomm Krait in premium tablets and servers.

64-bit in-order energy-efficient core found in mid-range smartphones such as the Raspberry Pi 3 and Snapdragon 410, optimized for sustained mobile workloads.

ARM's Cortex-A57 is a high-performance, out-of-order 64-bit ARMv8-A core with a 3-wide decode/issue pipeline, aggressive micro-op reordering, and NEON/FPU arrays, routinely paired in big.LITTLE clusters with Cortex-A53 companions in platforms such as Nvidia Tegra X1 and Samsung Exynos to cover flagship smartphone and tablet workloads.

ARMv8-A general-purpose CPU with a wide-issue, high-IPC pipeline, aggressive branch prediction, and NEON/FPU arrays, designed for flagship mobile SoCs around 2016.

Arriving in 2018, the ARM Cortex-A76 brought an out-of-order, high-frequency microarchitecture to flagship smartphone and PC compute, forming the performance backbone of Snapdragon 855 and Kirin 990 platforms.

The 2020 ARM core that improved IPC over Cortex-A77 and is deployed alongside Cortex-X1 and Cortex-A55 companions in Snapdragon 888 and Dimensity 1200 rivals.

ARM Cortex-A8 delivers an in-order dual-issue pipeline with NEON SIMD that powered Apple iPhone 3GS and other early smartphones, offering strong single-core performance for multimedia, UI, and application workflows.

The ARM Cortex-A9 is a multi-core, out-of-order general-purpose CPU that powered flagship 2011 devices such as the Samsung Galaxy S II and Nvidia Tegra 2 platforms, delivering energy-efficient performance for mobile workloads.

Energy-efficient 32-bit Cortex-M0/v0+ subset powering low-power STM32F0, NXP LPC800, and similar microcontrollers for industrial sensors, wearables, and simple consumer peripherals, offering predictable Thumb-2 instruction execution and deterministic response.

Low-power, microcontroller subset delivering improved performance with single-cycle branches and Thumb-2 capabilities, powering NXP LPC800 and Microchip SAM D10 boards for wearable and sensor applications.

32-bit mid-range MCU core used in STM32F1 and LPC1768 boards; features the Thumb-2 instruction set, nested vectored interrupt controller, and tightly coupled memory for deterministic response.

Security-enhanced ARM Cortex-M33 core with TrustZone, used in the STM32L5 series for constrained secure IoT control.

DSP-ready embedded microcontroller core with single-precision FPU and DSP extensions, powering STM32F4 and Teensy boards for real-time sensing, audio, and control workloads.

ML-enabled Cortex-M55 with Helium vector extensions, embedded in STM32U5 and other advanced microcontrollers for energy-efficient inference.

High-performance ARM Cortex-M7 core with a dual-issue pipeline, single-precision FPU, and DSP extensions, powering STMicroelectronics STM32H7, NXP i.MX RT, and Teensy 4.1 boards for physics-based sensing, motor control, and signal-processing applications.

ARM Cortex-M85 is a high-performance microcontroller core with 2nd-generation Helium vector extensions and dual scalar pipelines, powering the latest STM32U6 family to blend deterministic control with on-chip ML inference.

Performance-focused single-core introduced in 2020-2021 with wide execution used in Snapdragon 888 platforms to drive flagship compute workloads.

ARM Cortex-X2 is ARM's 2022 performance CPU core powering Snapdragon 8 Gen 1 and realizing flagship smartphone and PC compute workloads.

ARM Cortex-X4 is ARM's 2024 high-performance flagship core for next-generation flagship SoCs such as the Qualcomm Snapdragon 8 Gen 4, emphasizing aggressive pipelining and high IPC to sustain sustained high frequencies for mobile and AI compute workloads.

Latest Valhall-based upper-midrange GPU used in MediaTek Dimensity 9000/9000+ SoCs with big.LITTLE drivers, balancing graphics, display, and AI workloads.

The Mali G52/G57 series is a midrange Valhall architecture GPU used in chipsets like the MediaTek Dimensity 820, targeting efficient mobile graphics and compute.

ARM Mali G77/G78 high-end mobile GPU lineup powering Exynos 1080 and 2100 with Valhall architecture improvements in throughput, efficiency, and feature set.

ARM Mali G90 and G91 GPUs target flagship mobile devices, combining Valhall architecture top-tier GPU rendering with neural front-ends for AI and image processing. MediaTek Dimensity 1200 smartphones use this GPU cluster to accelerate multi-frame photography, computational video, and high-refresh-rate gaming workloads.

First-generation ARM Mali T series (T600 family) graphics clusters as integrated into Exynos 5 Octa and similar SoCs, delivering shader-based mobile and smart-TV graphics pipelines.

2019 server-focused microarchitecture tuned for high-frequency operation, scalable mesh interconnect, and energy-efficient cores for cloud infrastructure workloads.

2021 vector-optimized core built for high-throughput floating-point and SVE workloads, highlighted in Ampere and Fujitsu platforms.

ARM2 is a 1985 32-bit hitless architecture used in early Acorn Archimedes workstations, representing the bridge from general-purpose CPUs to early RISC embedded systems.

Modernized 32-bit ARM architecture derived from the Acorn lineage, powering Apple Newton PDAs and Amiga embedded workstations with a low-power RISC core.

ARM7TDMI is a 32-bit general-purpose CPU core in the ARM lineage that powered early smartphones like the Nokia 3310 and many microcontrollers; its Thumb instruction set compression and pipelined CMOS RISC architecture deliver high code density for tight memory budgets while still handling keypad, radio, and sensor control loops.

ARM9 dual-issue pipeline processors powered early smartphones like the Nokia N-series and handheld consoles such as the Sony PSP, balancing efficient instruction throughput with low power draw.

AWS-designed chip for deep learning inference with high throughput and low latency, used in Inferentia-based EC2 Inf1 instances and the Neuron SDK stack.

AWS custom training chip powering EC2 Trn1 instances with high throughput, supporting InfiniBand fabric for massive multi-node synchronization while running dense and sparse machine learning workloads at scale.

The 2020 vector-plus-scalar Alibaba XuanTie C910 couples a wide vector unit with a scalar pipeline to enable edge and AI acceleration workloads.

A hand-cranked bronze gearwork device built around 150–100 BC — the oldest known analog computer. Turning a single input crank advances 37 meshing gears whose tooth-count ratios encode the periods of the Sun, Moon, and planets. A differential gear (rediscovered in the 16th century) models the Moon's elliptical speed variation. Front dials show the zodiac position of Sun and Moon and display lunar phase via a half-silvered sphere; rear spiral dials track the 19-year Metonic cycle (235 lunations), the 18-year Saros eclipse cycle, and the 4-year Olympiad. Setting a date predicts eclipses and planetary positions decades ahead. Speed: instantaneous (gears turn as fast as the crank). Capacity: ~10 astronomical cycles simultaneously; eclipse prediction decades in advance.

Apple's Neural Engine inside A-series and M-series SoCs is a dedicated neural processing unit composed of hardware matrix multiply arrays and supporting SRAM/control pipelines that deterministically accelerate inference workloads on-device.

PowerPC 750 (G3) microprocessor powering early iMacs, PowerBooks, and consumer desktops with a CMOS RISC core, backside cache, and enhanced multimedia units in the late 1990s Apple lineup.

PowerPC 7400-based Apple G4 systems such as the iMac G4 and Power Mac G4 combine a superscalar CMOS core with AltiVec vector units to accelerate media-rich applications.

IBM PowerPC 970-based G5 microarchitecture brought 64-bit, dual-core performance to the Power Mac G5 lineup, executing PowerPC/Unix workloads with high-bandwidth caches and AltiVec acceleration while emphasizing Apple's desktop-class general-purpose computing goals.

Apple integrated GPU (e.g., in M1/M2) featuring unified memory, tile-based rendering, and tight coherence with the CPU to deliver graphics and neural compute within the SoC.

An 8-bit RISC microcontroller family with rich peripheral set and in-system programmable flash used across embedded projects, most notably as the ATmega328P on the Arduino Uno development board for robotics, sensing, and control.

Tiny AVR microcontrollers designed for constrained devices, often stitched into LED wearables and wearable control loops that need tiny flash footprints and low power draw.

Atmel/Microchip AVR32 is a 32-bit RISC microcontroller architecture with Harvard instruction/data pipelines, targeting deterministic audio signal processing and industrial control applications via the UC3/Applite families that integrate DMA, codecs, and peripheral controllers.

The BZ reaction is an oscillating chemical system that produces propagating excitation waves in a thin layer of reagent (typically ferroin or ruthenium catalyst in acidified bromate/malonate). Signals are encoded as wave fronts; the interaction of two colliding wave fragments implements logic at the collision site. Annihilation corresponds to AND; a wave passing through unimpeded corresponds to OR. Adamatzky demonstrated NOT, OR, AND gates in fixed channel geometries. A light-sensitive variant (with ruthenium catalyst) allows gates to be programmed by illumination patterns. A 2024 Nature Communications paper demonstrated a hybrid digital-chemical programmable array. Speed: ~1-10 mm/min wave propagation; seconds to minutes per gate. Capacity: small logic circuits; limited by wave-front geometry and reagent lifetime.

Proposed by Fredkin & Toffoli (1982). Balls travel on paths representing wires; presence/absence of a ball encodes a bit. Collisions at path intersections implement logic gates. Logically and thermodynamically reversible — no information is destroyed. Speed: nanoseconds to microseconds (ball velocity dependent). Capacity: arbitrary boolean circuits (theoretically universal).

The human brain contains ~86 billion neurons connected by ~10¹⁵ synapses. Each neuron integrates thousands of synaptic inputs and fires a spike when its membrane potential crosses threshold — a leaky integrate-and-fire operation. Computation is massively parallel, spike-coded, and energy-efficient at ~20 W total. Synaptic weights are plastic: Hebbian learning and spike-timing-dependent plasticity (STDP) modify connection strengths in response to activity, implementing online learning with no separate training phase. The brain solves tasks — scene understanding, language, planning — that remain beyond engineered systems at equivalent energy budgets. Unlike every other entry, the substrate is also the substrate of the observer. Speed: ~100 Hz spike rate per neuron; millisecond reaction times; years of learning. Capacity: ~86 billion neurons; ~10¹⁵ synapses; ~20 W; general-purpose cognition.

Identical single photons enter an m-mode linear optical network (beam splitters and phase shifters implementing a unitary U). Detectors at the outputs sample from a distribution whose probabilities are proportional to |Perm(U_S)|² — the squared permanent of submatrices of U — a quantity believed to be classically intractable to compute. Aaronson & Arkhipov (2011) proved that an efficient classical simulation would collapse the polynomial hierarchy. The device does not solve a user-defined optimization problem; rather, it demonstrates quantum advantage on a specific sampling task. Gaussian boson sampling (GBS) variants use squeezed-light inputs and have been demonstrated at scale (Jiuzhang, 2020). Speed: nanoseconds per sample (photon transit time through chip). Capacity: 53+ photons demonstrated (Jiuzhang); quantum advantage claimed for n≥50.

BrainScaleS wafer-scale neuromorphic system blends analog wafer-scale integrators with digital control, forming the EBRAINS neuromorphic computing service for fast emulation of spiking neural networks.

The CDC 6600 (1964) paired scoreboard-driven pipelines with dedicated peripheral processors to keep its transistorized core aimed at float-heavy scientific targets, chasing gigaflop-frontier workloads while isolating I/O on a peripheral backplane.

Programmable CEVA DSP cores such as the CEVA-TeakLite and CEVA-XC families power audio codecs, wireless basebands, and modem SoCs.

The Cell Broadband Engine pairs a Power Processing Element (PPE) with multiple Synergistic Processing Elements (SPEs) to support the PlayStation 3 and supercomputers such as IBM Roadrunner, delivering a heterogeneous multi-core fabric for demanding parallel computation.

The Cerebras Wafer-Scale Engine is a wafer-scale AI accelerator with millions of cores interconnected through a dense on-chip fabric, delivering massive compute for large-scale model training on systems like the CS-2.

A network of degenerate optical parametric oscillator (DOPO) pulses circulating in a fiber ring cavity. Each pulse can oscillate in one of two phase states (0 or π), encoding a spin. Measurement-feedback electronics couple the pulses according to the Ising coupling matrix programmed by the user. As the pump power increases past threshold, the network undergoes bifurcation and settles into a low-energy spin configuration. NTT's 2021 system used 100,000 DOPO pulses in a 5-km fiber loop. Unlike classical or quantum annealers, the CIM operates at room temperature and exploits optical coherence rather than thermal or quantum fluctuations. Speed: microseconds per Ising problem instance. Capacity: up to 100,000 spins (NTT 2021); competitive with quantum annealers on dense graphs.

A network of identical oscillators — pendula, LC circuits, or CMOS ring oscillators — coupled to their neighbours by springs or resistive links. The Kuramoto model describes how each oscillator's phase evolves under the pull of its neighbours. When the coupling weights encode a graph's edge weights, the system's stable phase configuration minimizes the same energy function as MAX-CUT: oscillators partition into two phase-locked clusters (0° and 180°) that approximately bisect the graph. Implemented in silicon as oscillator-based Ising machines with up to 1440 CMOS nodes; reported within 99% of optimal MAX-CUT on tested benchmarks. Speed: microseconds to milliseconds (oscillator ring-down time). Capacity: graph problems with hundreds to thousands of nodes.

1992 64-bit RISC pipeline featuring dual integer issue stages, a four-stage fetch/decode/execute/commit path, and on-chip 8 KB instruction and data caches, positioned for DEC server racks and AlphaStation workstations.

The DEC PDP-11 married an unusually orthogonal instruction set with dual UNIBUS and DIBUS pathways, letting its transistorized backplane with microprogrammed control deliver interactive time-sharing workstations that popularized UNIX and RSX operating systems, while responsive DMA-friendly peripherals powered a generation of general-purpose CPU designs.

The DEC PDP-8 combines a 12-bit accumulator, low-cost rack packaging, and roots in real-time control applications, embodying the general-purpose mainframe lineage.

DEC VAX introduced a 32-bit CISC ISA with virtualization extensions, vectored interrupt handling, and multi-processor cache-coherent CPU/cache modules, enabling general-purpose server and UNIX workloads across VMS and Unix environments.

Leonard Adleman's 1994 demonstration solved the directed Hamiltonian path problem using DNA strand hybridization. Cities encoded as DNA sequences, flight connections as complementary strands. Massively parallel biochemical search. Speed: hours to days (biochemical reactions). Capacity: combinatorial search problems (limited by DNA synthesis/sequencing).

Single-stranded DNA molecules in solution compute via toehold-mediated strand displacement: a short single-stranded 'toehold' on a partially double-stranded gate complex allows an input strand to invade, displace, and release an output strand. Presence/absence of a strand encodes a bit. Cascades of these reactions implement AND, OR, NOT, NAND, NOR, XOR, and threshold gates without enzymes or moving parts. Qian and Winfree (2011) demonstrated a four-bit square-root circuit from 130 DNA strands; a subsequent paper (Nature, 2011) realized a 30-node Hopfield neural network entirely in DNA solution. Speed: minutes to hours per logic operation (hybridization kinetics). Capacity: ~100-gate circuits demonstrated; massively parallel (each molecule is a gate).

Built by Vannevar Bush and Harold Hazen at MIT in 1928–1931, the differential analyzer is a general-purpose analog ODE solver. The core component is a wheel-and-disk integrator: a disk rotates at rate proportional to one variable; a wheel resting on the disk at a radial position proportional to a second variable rotates at their product — implementing ∫ y dx mechanically. Multiple integrators are chained via shafts and differential gears to represent higher-order ODEs. A torque amplifier (Bush's key innovation) prevents the tiny friction coupling from loading the computation. The MIT machine solved sixth-order ODEs; later machines solved 18th-order equations. The device is the missing link between the planimeter (single integral) and the fire-control computer (hardwired ODE). Speed: minutes per ODE solution (shaft rotation time). Capacity: up to 18th-order ODEs (later machines); ~3 significant figures.

A stack of passive, 3D-printed diffraction layers implements a trained neural network entirely in the optical domain. Each layer is a mask with pixel-wise phase or amplitude modulation, trained offline with backpropagation through a differentiable wave-optics model. During inference, light propagates through the layers via diffraction — no active computation occurs. The network function is encoded in the geometry of the passive masks. Lin et al. (2018, Science) demonstrated handwritten-digit classification at terahertz frequencies with 91.75% accuracy. Inference runs at the speed of light with zero dynamic energy consumption beyond the input illumination. Speed: picoseconds (optical propagation through ~cm of layers). Capacity: image classification at THz; scales with aperture area and layer count.

~800,000 human iPSC-derived or mouse cortical neurons are plated onto a high-density multi-electrode array (HD-MEA). The DishBrain system (Kagan et al., 2022, Neuron) embeds the culture in a simulated game of Pong: electrode stimulation encodes ball position and side; the recorded neural firing pattern drives paddle movement. Motivated by the free-energy principle — cells prefer predictable stimulation over white noise — the culture learns to rally the ball within five minutes of real-time play. No explicit training algorithm runs; the biology self-organizes. The substrate is neurons-in-a-dish, making this the only entry where the substrate is alive and may be sentient. Speed: minutes to learn; milliseconds per action (neural firing rate). Capacity: closed-loop sensorimotor tasks; ~800,000 neurons, ~22,000 electrodes.

Standing dominoes propagate a falling signal. Fan-outs split signals, and careful geometry implements AND and OR gates. Signal is one-shot — must reset by standing dominoes again. Speed: ~1 domino per second propagation (~10-50 seconds total). Capacity: single boolean expression evaluation (one-shot).

Espressif's 32-bit RISC-V wireless MCU with integrated Wi-Fi and BLE connectivity, targeted at IoT deployments.

Espressif's more capable MCU pairing 40nm RISC-V vector extensions with Wi-Fi 6 and BLE 5.3 to harden compute at the edge for secure IoT and smart sensing.

Balls dropped through a triangular array of pegs deflect left or right at each level. The distribution of balls in the output bins converges to a Gaussian as N→∞. Each peg is an independent Bernoulli trial. Speed: minutes to hours (depending on ball count). Capacity: statistical sampling (scales with number of balls).

A register of qubits — typically superconducting transmons cooled to ~10 mK — whose state is manipulated by sequences of microwave pulses implementing one- and two-qubit unitary gates. Any computation is a product of these gates, forming a universal gate set. Superposition lets a qubit represent 0 and 1 simultaneously; entanglement correlates qubits non-classically; interference is used to amplify correct answers and cancel wrong ones. Shor's algorithm factors n-bit integers in O(n³) gate operations vs. exponential classically; Grover's algorithm searches an unsorted list in O(√N). Current NISQ (noisy intermediate-scale quantum) devices have 100–1000 physical qubits with limited coherence; fault-tolerant quantum computing requires ~1000 physical qubits per logical qubit. Google's 2019 Sycamore experiment claimed quantum supremacy on a sampling task in 200 seconds vs. ~10,000 years classically. Speed: nanosecond gate times; microseconds coherence (NISQ era). Capacity: 53–1121 physical qubits (current hardware); fault-tolerant QC requires orders of magnitude more.

GigaDevice's GD32VF103 family (e.g., the GD32VF103C8T6) couples a 40nm RV32 core with DSP accelerators and single-cycle MAC units, delivering deterministic real-time motor, sensor, and industrial control loops with fast ADC-to-PWM pipelines.

A 53-qubit superconducting transmon processor built by Google AI Quantum that executed a random quantum circuit sampling task in 2019 to demonstrate quantum supremacy, providing empirical evidence of a computation outside the reach of classical supercomputers at the time.

Google's first TPU, announced in 2016, ties a large 256×256 systolic array built for dense matrix multiplies to local weight memory so inference workloads across Google data centers run deterministically with predictable throughput and latency from the ASIC systolic array hardware.

Google's second-generation TPU v2 is a datacenter-scale AI accelerator built around large systolic arrays, high-bandwidth memory, and bfloat16 matrix units, forming Cloud TPU v2 pods to deliver high-throughput training and inference for deep learning workloads.

Third-generation Google TPU pairs float32/16 matrix multiply arrays with HBM2 and Cloud TPU v3 pods contain 8x more TPU chips than the previous generation, delivering massive training and inference acceleration.

Google TPU v4 is the latest pod-scale accelerator from Google that deterministically realizes dense linear algebra and transformer attention via custom systolic arrays. Each TPU v4 die pairs stacked HBM3, the newest TPU pod interconnect routers, and liquid cooling to sustain the throughput demanded by the latest Cloud TPU v4 pods, which stitch thousands of chips across the pod interconnect fabric for multi-petaflop training.

Google's fifth-generation TPU (v5) is a datacenter AI accelerator optimized for massive matrix multiplies; each chip exposes more matrix units than v4, and when assembled into TPU v5 pods it delivers higher TFLOPS along with pod-scale interconnects that sustain large language model training and inference.

Graphcore's Intelligence Processing Unit (IPU) is a massively parallel AI accelerator composed of thousands of SRAM-backed tile cores linked by an exchange-style interconnect, enabling sparse tensor graph processing workloads in IPU-POD16 and IPU-M2000 systems.

Groq Tensor Streaming Processor delivers deterministic single-cycle tensor execution within Groq hardware so ML inference workloads observe predictable latency in massive pipelined flows.

32/64-bit RISC with in-order pipeline and dual-issue execution, tailorable to HP 9000 servers and workstations.

A chain suspended from two fixed points and left to hang under gravity settles into a curve that exactly realizes the hyperbolic cosine. Gaudí used physical catenaries (inverted) to design the arches of the Sagrada Família. Speed: instantaneous (static equilibrium). Capacity: single function evaluation (hyperbolic cosine).

Planned ~1000-qubit superconducting processor from the IBM Quantum roadmap, extending its gate-based quantum systems.

IBM Eagle superconducting quantum processor (127 transmon qubits) supports gate-based quantum circuits research and Qiskit experimentation toward fault-tolerant architectures.

IBM's Osprey is a 433-qubit superconducting heavy-hexagon processor in IBM Quantum System Two, engineered for Qiskit access and Qiskit Runtime workflows to run unitary circuits across hundreds of qubits via microwave control pulses.

1990s IBM POWER1 architecture: a general-purpose RISC design with superscalar execution and an expanded register file that powered RS/6000 systems.

The IBM POWER10 (2021) delivers high throughput compute with deep SMT and a focus on AI acceleration, optimizing matrix math and inference for demanding enterprise workloads.

IBM POWER7 (2010) introduces 8-way simultaneous multithreading, high-throughput virtualization, and energy-efficient design targeted to IBM Power Systems.

IBM System/360 unified IBM's commercial mainframe line with a single instruction set architecture, establishing upward compatibility and shaping enterprise computing for decades.

IBM TrueNorth is a 45 nm CMOS neurosynaptic chip with one million programmable spiking neurons and 256 million configurable synapses. It realizes massively parallel, event-driven computation with asynchronous low-power inter-core communication, enabling pattern recognition and sensor fusion workloads inspired by biology.

IBM Research analog AI chip uses memristive crossbar arrays with PCM elements to implement analog differential compute for neural inference, tightly integrating in-memory multiply-accumulate operations for ultra-low-power AI workloads.

Tile-based PowerVR SGX/Rogue GPUs deployed in Apple iPhone/iPad series, featuring deferred rendering pipelines for efficient mobile graphics and AR experiences.

The Intel 80286 advanced the x86 lineage with protected mode and richer 16-bit enhancements, pushing early departmental servers by exposing segmented protection and expanded memory beyond the 8086/88 era; example: Compaq Deskpro 286 and IBM PS/2 Model 80 servers running Novell NetWare relied on the new mode to host shared files and directories with segmentation-based protection.

Intel's 80386 microprocessor introduced 32-bit protected mode with paging and hardware multitasking support, forming the foundation for modern OS virtualization and advanced multitasking environments.

The Intel 80486 fused an on-chip floating-point unit, eight-stage pipelined datapath, and write-back L1 cache into one superscalar CMOS microprocessor, delivering deterministic x86 throughput for desktop applications while reducing bus contention compared to the 386.

8-bit microcontroller featuring Harvard architecture with separate code and data spaces, integrated timers, serial UART, and parallel I/O, widely deployed in embedded appliances for deterministic control loops.

Intel's first 16-bit CISC CPU powering early IBM PCs and compatible machines, notable for its segmented memory model that bridged 16-bit processing with a 20-bit address space.

Hybrid x86 microarchitecture pairing Golden Cove performance cores with Gracemont efficiency cores, guided by Thread Director for workload steering and offering DDR5 plus PCIe 5.0 support, as seen in systems like the Intel NUC 12 Extreme.

Intel Core (Yonah) dual-core mobile microarchitecture introduced for 2006 laptop platforms; features paired Yonah cores with shared cache, advanced power-efficient enhancements, and Intel 64 support for 64-bit notebook performance.

Haswell's 2013 microarchitecture pairs aggressive out-of-order cores, AVX2 vector extensions, the UPI fabric, fine-grained power gating, and Gen7 integrated graphics to drive responsive performance in mobile and desktop laptops.

Intel's Itanium (IA-64) combined a 64-bit EPIC/VLIW instruction set with compiler-managed parallelism, predication, and speculation to target enterprise and mission-critical workloads, primarily deployed in HP Integrity servers.

Intel Loihi 1 is an asynchronous digital neuromorphic research chip with 128 programmable cores connected by a packet-switched mesh, simulating roughly 130k neurons and 130M synapses per chip for robotics, vision, and sensor-fusion workloads with tens-of-microseconds spike latency and picojoule-scale synaptic updates.

Intel's second-generation neuromorphic research chip implements asynchronous event-driven spiking neural networks with tightly coupled memory and compute plus sparse programmable synapses for adaptive, low-power AI. Example: the Sandia National Laboratories Hala Point system deploys 1,152 Loihi 2 processors to model 1.15 billion neurons and 128 billion synapses while running continuous-learning workloads at better than 15 TOPS/W.

Category 1 general-purpose x86 CPU lineage microarchitecture that brings the memory controller on-die, couples cores with QuickPath, and uses Turbo Boost to lift throughput and power efficiency.

The Intel P6 (Pentium Pro) CPU introduced a deeply pipelined out-of-order superscalar core with an on-die L2 cache to accelerate enterprise workloads; example: Pentium Pro 200 MHz powering mid-1990s servers.

The Intel Pentium (P5) was Intel's first superscalar CPU, adding superscalar execution, dynamic branch prediction, and dual pipelines over the 486 to deliver significantly higher general-purpose performance.

The Intel Pentium 4 (NetBurst) pairs the NetBurst microarchitecture with a very long pipeline and the first mainstream Hyper-Threading implementation to chase high clock rates across desktop and server markets; example: 3.06 GHz Prescott chips scaled general-purpose workloads.

Intel's Sandy Bridge microarchitecture fused its CPU cores with the first-generation Intel HD Graphics, an improved branch predictor, and AVX support into a unified CMOS design to deliver deterministic x86-64 compute for mainstream PCs and laptops.

Intel Skylake is a 14nm FinFET microarchitecture featuring a micro-op cache, improved branch prediction, Gen9 graphics, and balanced desktop and laptop deployment.

Intel Xe Low Power (Xe-LP) GPU powers Tiger Lake and Arc Alchemist mobile SoCs, offering up to 96 execution units (768 vector ALUs) and dedicated media engines including AV1/HEVC encode and decode acceleration for thin-and-light laptops.

Ponte Vecchio GPUs combine HBM2e stacks, AVX-512 adapted cores, and a tile-based Intel 7/4 process optimized for HPC tiles, locking thousands of wide SIMT lanes per tile and coordinating them through a scalable fabric designed for large-scale scientific and AI workloads.

The Intel Xe-HPG family packages discrete Arc Alchemist GPUs to deliver hardware ray tracing, advanced media encode/decode, and AI acceleration for consumer gaming and creative workloads.

Upcoming Xe2 architecture is positioned as a next-gen tile-based GPU platform for discrete and data center workloads, extending Intel Xe with larger tiles and AI-ready matrix engines.

Trapped ion quantum computer hosted in vacuum chambers with photonic interconnects for modular entanglement, delivered over the cloud via IonQ Harmony and Aria.

Designed by Lord Kelvin (William Thomson) in 1872–73, this special-purpose mechanical analog computer performs real-time Fourier synthesis. Each tidal harmonic constituent (M2, S2, N2 …) is represented by a pulley on a crank whose radius sets the amplitude and whose rotation rate is geared to the constituent's period. A single wire threads over all pulleys in series; as a hand-crank advances time, the wire's endpoint traces the sum of all cosines, drawing the predicted tide curve on a paper roll. Kelvin's final version summed 24 harmonic components and could predict a full year of tides in about four hours. Variants were built for the US, India, and other nations and remained in operational use through World War II. Speed: a full year of tidal predictions in ~4 hours of cranking. Capacity: up to 40 harmonic components (later US machines); continuous output.

A fully mechanical computer built from LEGO Technic with no electronics. Binary memory is stored as lever positions on a rotating drum (rod logic); a read/write head flips levers to write bits and senses them pneumatically on readback. A joystick translates direction inputs into pneumatic signals that pass through a mechanical filter preventing illegal moves, then drive a 16×16 push-rod display. Demonstrated running the game Snake entirely in hardware. Speed: ~1 Hz game-tick (limited by pneumatic signal propagation through tubing). Capacity: 16×16 display state + snake tail buffer (tens of bits of working memory).

Lightmatter Passage optical AI accelerator uses photonic inference and light-based matrix multiplies to drive an optical dataflow across a waveguide matrix engine.

Liquid marbles are millimetre-scale droplets coated with hydrophobic powder that makes them roll freely without wetting surfaces. Computation is collision-based: two marbles directed at an intersection merge if their relative speed exceeds ~0.29 m/s (AND = 1, carry output) and rebound below that threshold (AND = 0, separate outputs). The three output trajectories encode AND and XOR simultaneously, forming a half-adder in a single interaction gate. By controlling routing channels and gate geometry, all classical gates (AND, OR, NOT, NAND, NOR, XOR) and the reversible Toffoli and Fredkin gates can be constructed. The Fredkin gate conserves marble count — no information is destroyed — making this a physical substrate for reversible and potentially thermodynamically efficient computing. Speed: ~0.1–1 s per gate (marble travel time at cm/s speeds). Capacity: gate-level; multi-cycle datapath demonstrated in simulation.

Luminous Computing centers on photonic logic for AI, building coherent-light neural accelerators orchestrated via optical dataflow.

A microfabricated proof mass (typically silicon, ~1 μg) suspended by folded-beam springs. Under acceleration, the mass displaces by x = ma/k (Hooke's law + Newton's second law in equilibrium). Displacement is read by capacitive sensing: the mass carries interdigitated comb fingers whose capacitance changes by ΔC ∝ x ∝ a. The device is a physical analog computer that continuously divides force by spring constant — realizing a = F/m at the hardware level without arithmetic. MEMS gyroscopes extend this to Coriolis-effect angular-rate sensing, and IMUs combine three-axis accelerometers and gyroscopes to integrate trajectory in 3D. Found in every smartphone, airbag controller, and inertial navigation unit. Speed: continuous real-time output (bandwidth typically 1 Hz – 10 kHz). Capacity: single scalar (or 3-axis) acceleration; sub-μg resolution in precision variants.

Introduced in 1985 with a five-stage RISC pipeline, the MIPS R2000 leaned on single-cycle integer basics, a concise load/store ISA, and predictable control flow to keep each stage decoding, executing, and retiring in lockstep, which made it attractive to early SGI and DEC workstation vendors.

The MIPS R3000 builds on the R2000 with 33/64-bit addressing flexibility, deeper pipelines, write-back caches, and expanded coprocessor support, making it the go-to processor for high-performance workstations like SGI Indigo and consoles such as the Sony PlayStation.

MIT Tagged-Token Dataflow Architecture pairs high-performance scheduling with tagged token contexts that encode activation frames, letting distributed execution units match tokens, dispatch operands, and fire instructions for Id programs.

Built by Bill Phillips (1949). Water flows through tanks and pipes representing economic sectors — income, consumption, taxation, investment. Flow rates encode economic quantities. The system settles into equilibrium representing GDP balance. 14 machines were built. Speed: minutes to hours (hydraulic equilibration). Capacity: ~10-20 economic variables (limited by physical plumbing).

The Manchester Dataflow Machine concept of the 1970s emphasized token-based dataflow execution with tokens flowing through FIFO routers and firing operations out of order as soon as operands arrived, exposing fine-grained dataflow computation across processors.

Gravity-fed marble runs with rocker/seesaw gates implement binary arithmetic and logic operations. One marble = 1 bit. The rocker flips state on each pass, implementing half-adders and logic gates. The Digi-Comp II (1965) is the canonical plastic educational design, while K'NEX construction sets allow modular prototyping of custom layouts. Speed: ~1-10 seconds per operation (marble transit time). Capacity: 3-8 bit operations (modular, expandable).

Electromechanical analog computers installed on WWII-era warships (e.g. the US Navy Mark 1) continuously computed the correct bearing and elevation for naval guns from up to 25 live inputs: target range, target bearing, own-ship speed and course, wind speed, shell muzzle velocity, and more. Seven classes of mechanism — shafts, gears, cams, differentials, component solvers, integrators, and multipliers — were combined to solve the fire-control problem in real time. Speed: continuous real-time (output updated as fast as inputs change). Capacity: ~25 input variables → 2 output variables (bearing, elevation).

A spinning rotor mounted in gimbals conserves angular momentum. Any external torque causes precession perpendicular to both the spin axis and the applied torque — rather than tilting directly. By reading gimbal angles, the device outputs the accumulated rotation of the platform relative to inertial space. It is a physical integrator: angular velocity in → angle out, with no arithmetic required. Inertial navigation systems chain three orthogonal gyroscopes with three accelerometers; double-integrating the accelerometer outputs (in the gyroscope-maintained inertial frame) gives position. Mechanical gyros guided Apollo missions and ICBM warheads; they have largely been replaced by MEMS and ring-laser gyroscopes but remain the conceptual anchor of inertial navigation. Speed: continuous real-time (spin-up time seconds to minutes). Capacity: 3-axis orientation; drift accumulates over time (arcseconds per hour in precision instruments).

Memristive circuits implementing Hopfield network topology where the intrinsic nonlinearity of memristors creates transient chaotic annealing processes. The chaotic dynamics enable escape from local minima for solving optimization problems like Max-Cut and continuous function optimization.

Crossbar arrays of memristors (memory resistors) perform matrix-vector operations in analog. Voltages applied to rows, currents collected from columns. Resistance values encode matrix elements. Enables in-memory computing for neural network inference. Speed: nanoseconds (electrical propagation). Capacity: large matrix operations (scales with crossbar size).

tiny 8-bit microcontroller (PIC10F series) for cost-sensitive control loops.

Microchip PIC12 (PIC12F series) 12-bit CMOS microcontrollers in tiny DIP/SOT packages, supporting direct LED and sensor control on compact embedded boards.

The 8-bit PIC16 family combines a Harvard architecture with a pipelined instruction path, making it a staple of hobbyist and professional controllers used in automation devices and embedded teaching rigs.

The PIC18 family pairs an 8-bit enhanced core, pipelined execution, and extended instruction set with rich peripherals, making it deterministic, exact, and suited to robotics and instrumentation workflows requiring tight control loops.

Microchip's PIC24 line is a 16-bit mid-range DSC used in motor control, blending a microcontroller-friendly datapath with DSP extensions for PWM, ADC, and sensor-feedback loops.

Microchip PIC32 is a 32-bit MIPS-based microcontroller line that equips advanced embedded systems with DMA-driven peripherals, caches, and large flash to anchor automation, connectivity, and real-time instrumentation workloads.

Mill's architecture merges a belt machine register model with VLIW-style wide-issue and deeply pipelined stages to pursue high efficiency and sustained throughput, relying on a compiler-centric workflow to schedule operations on the belt.

NVIDIA's Ampere family (GA100, GA102) integrates second-generation ray tracing cores with third-generation tensor cores, powering the A100 data-center accelerator and GeForce RTX 3090 consumer card on a 7 nm FinFET substrate.

Upcoming NVIDIA Blackwell architecture is tuned for inference, pairing DPX and Tensor cores in the Grace Hopper superchip line to accelerate dense matrix and sparse transformer workloads.

NVIDIA Fermi GF100 architecture introduced compute capability 2.x with ECC-protected GDDR5, hardware thread scheduling, and large register files; Tesla C2050 HPC accelerators and GeForce GTX 480 gaming cards both deploy this silicon to accelerate dense linear algebra, physics simulations, and shader pipelines.

The Hopper family (H100) is NVIDIA's GPU architecture for large-scale transformer training, pairing a new Transformer Engine with CUDA/SIMT cores and tensor cores on a 4nm/5nm FinFET node; HGX H100 cabinets tie multiple Hopper GPUs via NVLink to deliver deterministic throughput for massive AI workloads.

Kepler GK110/GB100 derivatives underpin Tesla K80 and GeForce GTX 780, delivering CUDA compute services with large SMX arrays tuned for both HPC and graphics tasks.

NVIDIA Maxwell (GM204/GM200) architecture drives energy-efficient graphics and compute, powering GeForce GTX 980 and Tesla M40 with improved power efficiency and mixed-precision throughput.

NVIDIA Pascal GPUs built on the GP104 architecture deliver high bandwidth memory and compute-intensive blocks, powering Tesla P100 accelerators and GeForce GTX 1080 cards across HPC and graphics workloads.

NVIDIA Tesla GPU compute cards deliver massively parallel floating-point and tensor acceleration for HPC, AI training, and inference, leveraging NVLink, HBM memory, and CUDA programming to pack thousands of CUDA cores and Tensor cores behind blower-style cooling.

NVIDIA Turing is a GPU microarchitecture that uses dedicated real-time ray tracing RT cores and Tensor cores for deep learning while powering GeForce RTX 2080, Tesla T4, and similar boards.

NVIDIA Volta GPU architecture built on the GV100 die with tensor cores, HBM2 memory, and NVLink, powering Tesla V100 accelerators and DGX-1 systems.

Silicon chips that mimic neural computation using spiking neurons and synaptic connections. Intel Loihi and IBM TrueNorth implement event-driven, asynchronous processing with on-chip learning capabilities. Speed: microseconds (spike propagation). Capacity: millions of neurons (parallel event-driven processing).

Normal Computing's stochastic processing units leverage probabilistic analog circuits with thermodynamic noise shaping and memristive elements to accelerate AI inference workloads while embracing physical stochasticity.

Operational amplifiers configured as integrators, adders, and multipliers solve differential equations in real-time. Voltages represent variables, circuit topology encodes the equation structure. Classical electronic analog computation. Speed: real-time (microseconds to seconds). Capacity: systems of ODEs (~10-100 variables typical).

A 4f lens system consists of two lenses separated by twice their focal length with a holographic or spatial-light-modulator (SLM) filter at the shared Fourier plane. The first lens computes the Fourier transform of the input image; the filter multiplies by the complex conjugate of the reference pattern's Fourier transform; the second lens inverse-Fourier-transforms the product, yielding the cross-correlation at the output plane. This implements a matched filter — the canonical operation for detecting a known pattern in a cluttered scene — in a single optical pass at the speed of light, regardless of image size. The system realizes the convolution theorem physically: FT(f⋆g) = F*·G. Used in optical character recognition, fingerprint identification, and radar pulse compression. Speed: picoseconds to nanoseconds (optical propagation through ~cm path). Capacity: full 2D cross-correlation of megapixel images in a single pass; filter change requires SLM reprogramming.

General-purpose CPU lineage to PowerPC/POWER, POWER2 couples advanced superscalar, multi-issue pipelines and large caches in a high-frequency multi-chip module to deliver server-class compute for IBM RS/6000 and AS/400 deployments.

A 64-bit out-of-order PowerPC architecture with multiple integer and floating-point execution units and high floating-point throughput, deployed across IBM RS/6000 and pSeries servers.

Dual-core Power4 (2001) high-throughput server DNA that underpins IBM pSeries 690 and eServer P655 SAN clusters, delivering 64-bit PowerPC/POWER general-purpose CPU throughput for block-storage control and enterprise compute workloads.

Multi-core POWER5 processor deployed in IBM eServer pSeries machines (e.g., p5 590) for virtualization workloads.

2007 IBM POWER6 is a high-frequency PowerPC/POWER chip with SMT and hardware virtualization that powered IBM pSeries 570/780 enterprise servers.

2012 IBM POWER8 many-core processor with CAPI support, deployed in Power Systems E870 servers for enterprise workloads.

2017 IBM POWER9 processor featuring OpenCAPI, NVLink, and SMT, powering Summit and IBM Power9 servers.

Arrays of Mach-Zehnder interferometers (MZIs) and microring resonators on a silicon chip implement programmable unitary matrices in the optical domain. Light encodes values as amplitude or phase; passing through a mesh of beam-splitters (MZIs) with tunable phase shifters multiplies an optical input vector by the weight matrix in a single forward pass. Because photons travel at c and interference is intrinsically parallel, a single matrix-vector multiply completes in picoseconds with energy consumption set only by modulation and detection, not arithmetic logic. MIT demonstrated a photonic processor running all key deep-learning operations on-chip. Neuromorphic silicon photonics has achieved 50 GHz tiled matrix multiplication. Speed: picoseconds per matrix-vector multiply; 50 GHz demonstrated. Capacity: 64×64 to 512×512 unitary matrices on current chips; ~4-6 bit precision.

The plasmodial slime mold extends filaments toward nutrient sources and progressively reinforces paths that carry more flow, pruning inefficient routes. Toshiyuki Nakagaki showed it reproduces the Tokyo rail network topology. Speed: hours to days (biological growth/optimization). Capacity: network optimization problems with ~10-100 nodes.

A two-bar linkage with a tracing point at one end and a measuring wheel mounted on the tracer arm. When the operator traces the boundary of an arbitrary shape, the wheel rolls only in the direction perpendicular to the tracer arm — the component encoding the integrand of Green's theorem (∮ x dy). The total wheel rotation equals the enclosed area regardless of path geometry. The polar planimeter (Amsler, 1854) requires no straight guide rail and works anywhere on a flat surface. Precision versions routinely achieve 0.1% accuracy. Historically used in cartography, engineering drawing, and medical imaging to measure irregular areas from printed plans. Speed: seconds to minutes per area measurement (tracing speed). Capacity: single scalar output (area); arbitrary curve complexity.

A jet of air entering a Y-shaped channel naturally attaches to one wall (the Coandă effect) and locks into that state by low-pressure recirculation. A small control jet on the opposite side provides enough momentum to switch the main jet to the other wall — bistable flip-flop behaviour with no moving parts. AND, OR, NOT, and NOR gates are realized by channel geometry; outputs fan out by splitting the attached jet. Developed in the early 1960s at the Harry Diamond Laboratories (Bowles, Gottron) and widely used in industrial control until PLCs displaced them. Inherently radiation-hardened (no electronics) and tolerant of dust and oil. MTBFs of 25,000–50,000 hours reported. Speed: milliseconds per gate switching (air transit time). Capacity: arbitrary boolean circuits; industrial systems ran thousands of gates.

Qualcomm Hexagon is a VLIW DSP inside Snapdragon SoCs that accelerates audio, vision, and machine learning workloads.

Qualcomm Hexagon NPU is the tensor accelerator embedded in Snapdragon platforms, combining Hexagon DSP cores and tensor accelerator fabric to deliver power-efficient on-device inference.

Quantum and quantum-inspired systems for solving combinatorial optimization problems through annealing processes. Includes true quantum annealers (D-Wave) using superconducting qubits and quantum-inspired CMOS implementations (Fujitsu, Toshiba, Hitachi) that simulate annealing dynamics. Speed: microseconds to seconds. Capacity: hundreds to thousands of variables.

Superconducting qubits manipulated by microwave pulses to perform unitary operations. Quantum gates like Hadamard, CNOT, and phase gates enable quantum algorithms such as Shor's factoring and Grover's search. Speed: nanoseconds to microseconds (gate operations). Capacity: exponential in qubit count (theoretical universal quantum computation).

Elowitz & Leibler (2000, Nature) constructed a synthetic oscillator in E. coli from three mutual repressor genes wired in a ring: LacI represses tetR; TetR represses cI; CI represses lacI. No gene product directly activates its own production, yet the circular negative feedback drives sustained oscillations in protein concentration with a period of ~150 minutes. The repressilator is a physical implementation of a relaxation oscillator: the mathematical operation is sustained limit-cycle dynamics, the same function realized by a CMOS ring oscillator or a Van der Pol circuit — but in living cells. Demonstrates that genetic regulatory networks can be designed as analog computing substrates, encoding functions (oscillation, bistability, logic) in DNA sequence. Speed: ~150 min period (transcription/translation kinetics). Capacity: single-frequency oscillator; frequency tunable by changing promoter strength or mRNA degradation rate.

Fixed nonlinear dynamical system (reservoir) coupled to a trained linear readout layer. Input drives the reservoir dynamics, output layer learns to extract desired computations. Echo state networks and liquid state machines are implementations. Speed: depends on reservoir substrate (microseconds to seconds). Capacity: temporal sequence processing (scales with reservoir size).

A sheet of Teledeltos — carbon-coated resistive paper with ~6 kΩ/square resistivity — conducts current that obeys the same Laplace equation as electrostatic potential, steady-state heat conduction, inviscid fluid flow, and Darcy groundwater seepage. Boundary conditions are imposed by painting silver-loaded conductive ink in the shape of conductors or flow boundaries; a voltage is applied across them. A probe voltmeter scanned over the sheet reads the potential field directly. Complex 2D geometries that would require days of PDE numerics can be mapped in hours. Widely used from the 1930s through the 1970s in capacitor design, transformer core analysis, dam seepage studies, and aircraft aerodynamics before finite-element codes displaced it. Speed: hours for full field map (manual probe scanning); boundary setup in minutes. Capacity: 2D scalar field on arbitrary domain geometry; ~1-2% accuracy.

Elastic bands stretched between pins hammered into a board relax under tension to a state of minimum total length. Because each band pulls with a force proportional to its extension, the equilibrium configuration satisfies the equal-angles condition at every interior junction — the defining property of a Steiner tree. The result is the shortest network connecting all pins, approximating the solution to the NP-hard Euclidean Steiner tree problem. The mechanism is combinatorially distinct from the soap-film Steiner tree (Plateau's problem in 2-D) because the topology of junctions is fixed by the discrete wiring of the bands, not by a continuous surface. Speed: instantaneous (elastic equilibration). Capacity: Steiner tree for ~5-20 pins (limited by physical layout).

SPARC (Scalable Processor Architecture) is a VLSI RISC architecture from Sun Microsystems/Oracle featuring register windows that keep deep call stacks performant and powering Sun and Oracle workstations and enterprise servers.

Reconfigurable Dataflow Units implement granular dataflow graphs by combining configurable tiles with per-tile scheduling and streaming data paths. Each tile bundles compute arrays, SRAM buffers, and intra-tile routers, enabling DataScale to keep thousands of MACs busy while mapping compiled tensor graphs to the dataflow fabric.

Samsung HBM2 memory with Processing-in-Memory logic for AI, deployed in research prototypes to offload vector-heavy kernels and shorten data movement for next-generation accelerators.

The SiFive U54-MC cluster combines four RV64IMAFD general-purpose cores with a supervisory S7 management core and coherent cache fabric, delivering low-power Linux-capable RISC-V compute used on HiFive Unmatched and Unleashed development boards.

High-performance out-of-order RISC-V core family from the SiFive Performance Series, optimized for scale-up Linux server deployments.

The SiFive X280 is a vector extension core aimed at AI acceleration, realizing machine learning inference workloads in SiFive Lighthouse and other AI boards.

A physical system coupled to a heat bath at slowly decreasing temperature explores its energy landscape. At high temperature it escapes local minima; as T→0 it settles into a global minimum — if cooling is slow enough. Speed: minutes to hours (depends on cooling schedule). Capacity: global optimization problems (scales exponentially with problem size).

Logarithmic scales engraved on sliding rules allow multiplication by physical addition of lengths (log a + log b = log ab). Precision is bounded by engraving quality and human reading resolution — typically 3 significant figures. Speed: seconds (human reading time). Capacity: single arithmetic operation (3 significant figures).

A soap film spanning a closed wire boundary settles into the surface of minimum area — the solution to Plateau's problem. For two parallel rings it realizes a catenoid. Can approximate Steiner trees for planar point sets. Speed: seconds to minutes (surface tension equilibration). Capacity: continuous optimization over infinite-dimensional space.

Cut n spaghetti strands to lengths proportional to the n values to be sorted. Gather them loosely in a fist and lower them vertically onto a flat table so all strands stand upright. Lower a flat hand from above: the first strand it touches is the maximum. Remove it, record the value, repeat — each contact extracts the next-largest in O(1) time. Preparing the rods is O(n); the n extractions are O(n); the whole sort is O(n) in physical time, exploiting the parallel nature of gravity and contact. Introduced by A. K. Dewdney in Scientific American. Illustrates how physical parallelism can circumvent the Ω(n log n) comparison-sort lower bound by using a non-comparison primitive (contact with a plane). Speed: O(n) physical steps; each step is constant time. Capacity: n positive real values; precision limited by ability to cut and measure strand lengths.

SpiNNaker machines at the University of Manchester network over a million ARM968 cores via packet-switched triple-torus routers to run spiking neural networks with local plasticity and sensor-motor I/O in real time, letting cortical-scale models stay synchronized as spikes hop across the mesh.

Sun SPARC v8 is a 64-bit RISC core with register windows, clean encoding, and Solaris Ultra workstation deployment used in Sun Ultra 1/2 systems to accelerate interactive multi-threaded UNIX development tasks.

Sun SPARC v9 extends the SPARC ISA with full 64-bit improvements, wider floating-point units, and server-class scaling (large caches and coherent SMP) to keep pace with Solaris enterprise services.

Texas Instruments' MSP430 family is an ultra-low-power 16-bit microcontroller platform widely used in energy-harvested sensor nodes and low-power embedded monitoring tasks, combining deep sleep modes with fast wake-up and analog integration for deterministic control loops.

TI TMS320 C2000 family of 32-bit fixed-point DSPs optimize deterministic motor control loops with on-chip ADCs, PWMs, comparators, and other peripherals for real-time sensing and actuation.

Low-power 16-bit fixed-point DSP for audio and voice processing widely used in digital hearing aids.

The TI TMS320 C6000 family are 32-bit VLIW/very long instruction word DSPs engineered for high-throughput signal processing, often deployed in base station and other wireless infrastructure hardware.

Configurable VLIW/dual-issue DSP core used across Cadence HiFi audio DSP families such as HiFi 3 and Tensilica LX processors, enabling extensible ISA custom instructions for audio decoding and machine-learning inference.

Tenstorrent Grayskull is a tile-based architecture of compute tiles with systolic arrays paired with on-tile high-bandwidth memory to deliver massive data-parallel tensor math and training throughput for large neural networks.

Tenstorrent Wormhole is a multi-chip module designed for large language models, providing a high-bandwidth interconnect and integration with the Tenstorrent software stack.

Uses thermal noise in analog circuits to sample from Boltzmann distributions. Thermal fluctuations provide natural randomness that follows statistical mechanics principles. The Normal Computing SDE (Stochastic Differential Equation) approach leverages this thermal noise for computation. Speed: microseconds to milliseconds (thermal equilibration). Capacity: probabilistic sampling problems (scales with circuit complexity).

Analog physics-based computers using thermodynamic principles for computation. Normal Computing's Stochastic Processing Unit (SPU) uses RLC circuits as unit cells with all-to-all coupling via switched capacitances, natively simulating Langevin/Ornstein-Uhlenbeck dynamics for probabilistic reasoning, generative design, and scientific computing.

Transmeta's Crusoe family paired a 256-bit VLIW core with Code Morphing Software that dynamically translated x86 binaries into ultra-low-power native instructions, caching hot traces and emulating the entire Intel stack to run fanless ultra-portables. The dynamic code-morphing VLIW chip and 64-bit translation engine delivered x86 compatibility and long battery life for early Efficeon/Crusoe notebooks such as the NEC MobilePro and Sony VAIO U-series thin clients.

UPMEM Processing-In-Memory DIMMs combine DRAM banks with embedded RISC DPUs, enabling data-center scale parallel search and graph analytics without moving data back to host CPUs.

Ventana's high-performance multi-core RISC-V targeted at AI and HPC workloads within Ventana's own compute node and server fabric, optimizing for chiplet scalability and domain-specific acceleration.

Water levels in vessels encode binary digits; a siphon and slow drain combine to implement AND and XOR in a single cup-and-tube unit. A filled cup is a 1, an empty cup a 0. When two cups feed one container the siphon trips (AND = carry), while the remainder leaks out the XOR drain. These half-adder cells chain into a multi-bit ripple adder. No moving parts beyond the water itself. Speed: seconds to minutes per bit (gravity-driven flow). Capacity: 4-bit addition demonstrated; theoretically scalable.

Two steel balls are mounted on hinged arms linked to a rotating vertical shaft driven by the engine. As engine speed increases, centrifugal force swings the balls outward and upward; through a collar linkage this motion partially closes the steam throttle, reducing power and slowing the engine. As speed falls the balls drop, the throttle reopens, and the cycle repeats. The system finds equilibrium where centrifugal force exactly balances gravity — and that equilibrium corresponds to the desired set speed. James Watt adapted this in 1788 from a windmill governor; James Clerk Maxwell's 1868 paper 'On Governors' analysed it as the first mathematical treatment of feedback control. The device is a physical analog computer that continuously solves the equation: throttle = f(ω − ω_set). Speed: continuous real-time (mechanical response time ~0.1–1 s). Capacity: single-variable set-point control; extends to multi-variable with additional linkages.