Orbital Data Centers Have a Silicon Problem Nobody Is Pricing

Last week, SpaceX unveiled AI1: a first-generation orbital data center satellite with a wingspan wider than a 747 and a 110-square-meter radiator, announced the same week as the company’s IPO. The FCC filing behind it contemplates a constellation of up to a million satellites. NVIDIA announced space-rated versions of its accelerated computing platforms at GTC in March. Elon Musk has predicted that within five years, more AI compute will be launched annually than the cumulative total operating on Earth. Take that forecast however you like; the capital behind it is real.

The public debate has settled into two camps. The bulls point at free solar power and an infinite heat sink. The bears point at launch costs and the impossibility of sending a technician to swap a failed board at 550 kilometers. Both camps are arguing about the right things: power, thermal, economics, serviceability.

But there’s a line item missing from almost every analysis I’ve read, and it happens to be the one I’ve spent the last several years pricing: what the space radiation environment does to commercial silicon, what it costs to find out, and where the uncertainty actually bites.

I lead product strategy for radiation-tolerant microelectronics: memory, processing, and storage flying on defense and commercial space programs across the major primes. My job, reduced to one sentence, is deciding which silicon can be trusted in orbit and what it takes to earn that trust. So I read the orbital compute roadmaps with a specific question in mind, and what I see is an industry pricing every constraint except the one my field spends entire careers managing.

Let me be precise about the environment, because overstating it is how credibility dies with the people who know better.

A 550-kilometer, mid-inclination orbit inside a spacecraft structure is the gentlest neighborhood in space. The dose environment is dominated by trapped protons and the South Atlantic Anomaly. Total ionizing dose over a five-year life, behind even incidental shielding, lands in low-kilorad territory that plenty of commercial silicon tolerates. If TID were the whole problem, this essay would not need to exist.

The residual problem, the one that does not shield away, is single-event effects. Galactic cosmic ray heavy ions arrive with energies that make shielding a losing proposition, and when one transits a transistor, things happen: a single-event upset flips a bit, a single-event functional interrupt hangs a controller until reset, and a single-event latch-up creates a parasitic short that destroys a device unless power is cycled fast enough. That last one matters because it is not graceful attrition. It is a permanent, sometimes cascading, hardware loss.

And modern advanced-node silicon is, by design, more exposed to this class of problem, not less. Every process shrink reduces the charge that defines a stored bit, which reduces the energy a particle needs to flip it. The highest-density memory, the HBM stacks and high-speed DRAM that make an AI accelerator an AI accelerator, is the most sensitive silicon on the manifest. Terrestrial operators already see the preview at sea level: large-scale DRAM field studies have long traced a fraction of memory errors to particle strikes, and the hyperscalers’ silent data corruption work shows what uncharacterized silicon failure modes do to a fleet at scale, all under a full atmosphere of shielding. On orbit, the particle flux driving the strike-induced share of those failures rises by orders of magnitude.

Here is where the launch revolution genuinely changes the calculus, and it deserves a fair hearing. A fully reusable heavy-lift vehicle delivering on the order of a hundred-plus tonnes per flight makes mass cheap in a way the space industry has never experienced. Cheap mass buys real radiation margin: thicker structure eats trapped-proton dose, generous TID margins become nearly free, redundant strings cost kilograms instead of programs, latch-up-tolerant power architectures can afford the extra circuitry, and the radiators can be as big as the thermal engineers want.

What cheap mass does not buy is protection from cosmic ray heavy ions. Stopping GCR takes not millimeters of aluminum but something closer to meters of material, and the shielding math gets perverse before it gets better: high-Z shielding struck by high-energy particles produces secondary particle showers that can raise the effective upset rate behind the shield. The materials science answer (graded-Z stacks, hydrogen-rich layers) helps at the margins and matters for crewed vehicles, but no commercially sane mass budget shields an AI accelerator out of the heavy-ion environment. Starship changes what mass can buy. It does not change what mass cannot buy.

Now the argument a sharp reader is already composing: AI accelerators are economically obsolete in two or three years regardless of radiation. If the fleet is being deorbited and replaced on a fast cadence for compute-economics reasons, radiation-driven attrition does not need to be precisely characterized. It just needs to stay below the obsolescence line. Crude bounds plus margin may suffice.

This is a genuinely strong argument and I will concede most of it for the workload it fits: stateless, retryable inference on independent nodes, where a crashed job restarts and a dead node is a line item. If that is the whole business, the disposability model covers a lot of sins.

It fails in four specific places. First, the design phase. Before the first thousand nodes fly, someone has to size the error correction, the checkpoint strategy, the power protection, and the spares line against numbers that do not exist yet, and a 5x miss in either direction means either a fleet that under delivers or margin mass flown for nothing, multiplied by the constellation. Second, destructive failure modes. Latch-up is not attrition that amortizes; an unprotected part lost to SEL is gone, and whether your fleet loses 0.1 percent or 3 percent of nodes per year to permanent hardware kills is exactly the kind of number nobody can currently quote for current-production commercial parts. And before anyone objects that modern low-voltage processes have largely engineered latch-up away: susceptibility at advanced nodes is itself part-dependent, some die are close to immune and some are not, and knowing which bucket your part falls in is precisely the characterization that does not exist. Third, the electronics that are not the accelerator. The bus, power management, and flight computers are expected to outlive several payload generations, and they do not get the disposability excuse. Fourth, the workloads the gigawatt roadmaps actually advertise. Training and long-running distributed jobs span many nodes, and fault tolerance for those does not come free, which brings us to numbers.

A piece about pricing should contain a price, so here is the back-of-envelope, with every input explicitly illustrative.

For a long-running job check pointed against failures, the classic approximation says overhead scales with the square root of (checkpoint cost divided by mean time between failures). Take a distributed job spanning 64 accelerator nodes with a 2-minute checkpoint-and-restore cost. If interrupt-class events (the upsets and functional interrupts that survive error correction and actually kill the job, since ECC scrubs the vast majority of raw upsets transparently) occur at 3 per device-year, the job sees one every couple of days and the math works out to roughly 4 percent overhead: tolerable. If the true rate is 15 per device-year, a 5x miss well inside the uncertainty band for an uncharacterized part, overhead roughly doubles to around 9 percent, and it scales up with both the rate and the number of nodes a job spans. Nine percent of every distributed job’s compute is not a rounding error; it is the difference between a business case and a press release for exactly the workloads the gigawatt roadmaps advertise. Layer on the destructive-loss question (a fleet that loses 2 percent of nodes per year to latch-up needs a permanently fatter replacement line than one that loses essentially none) and the point stands without any heroic assumptions: the cost of not knowing the failure rates is itself a quantifiable line item, and it compounds at constellation scale.

That is the precise sense in which I mean nobody is pricing the silicon. Not that radiation makes orbital compute impossible. That the uncertainty band around commercial-silicon behavior is wide enough to move the total cost of ownership math, and the industry is currently carrying that band at zero.

Here is an experiment anyone can run. Pick a mainstream, current-production commercial memory die: the kind of LPDDR or NAND that would actually fly inside an orbital compute node. Now go find its radiation characterization: upset cross-sections, latch-up thresholds, dose tolerance.

To be fair about what exists: decades of NASA, JPL, and ESA test reports cover thousands of parts, and the automotive industry characterizes terrestrial neutron soft-error rates under JEDEC standards. The archive is real. It is also sparse where it matters, stale against die revisions, and thinnest precisely at the current-production advanced nodes an AI payload would fly. Terrestrial neutron FIT data does not translate cleanly to heavy-ion response on orbit. And the structural problem compounds: independent beam campaigns cost hundreds of thousands of dollars per part and most of a year of scarce facility time, while commercial die steppings churn faster than campaigns run. By the time you have characterized a die, the vendor has shipped a new revision with different masks, different sensitivity, and the same part number.

So when a roadmap says commercial silicon with system-level mitigation, what it is actually saying is: we will size our error budgets, our protection circuits, and our replacement economics against failure rates that are unmeasured for the parts we will actually fly.

There is a self-correcting mechanism here, and intellectual honesty requires giving it full weight. The first operator to fly a thousand compute nodes will harvest better failure statistics in months than any beam campaign produces in a year, and the telemetry automatically tracks the die revision actually flying. At scale, operation is characterization. This is the strongest version of the disposability argument, and it genuinely narrows my claim.

The gap bites hardest in the design phase, before anyone’s telemetry exists, when the architecture decisions get sized against guesses. It bites every operator who is not first, and every part-selection decision for the next generation, because fleet telemetry tells you about the die you flew, not the one you are about to choose. And it bites the industry collectively, because telemetry without beam-anchored ground truth confounds effects that engineering needs separated.

Which brings us to the company everyone is thinking about. Design your own silicon, characterize your own die, hold the stepping stable, fly at scale, treat the constellation as a permanent radiation test campaign, iterate against real flight data. That closed loop is exactly the strategy SpaceX appears to be assembling, and I would argue it proves the thesis rather than refuting it: the fastest-moving operator in the field is already behaving as if qualification is the bottleneck. But two structural limits remain even there. Nobody fabs their own HBM; the most radiation-sensitive silicon on the manifest stays merchant silicon from a three-vendor oligopoly no matter how vertical the rest of the stack becomes. And a loop closed privately converts the industry’s data gap into one company’s moat.

That last point deserves honest framing. Private characterization is not irrational; it is a prisoner’s dilemma. Each operator is individually better off keeping its hard-won failure data proprietary, and the industry is collectively poorer for re-buying the same physics at full price, operator by operator, while the one player with fab control and fleet scale compounds its lead. They get resolved by infrastructure: shared test consortia, simulation-anchored qualification standards, statistical lot-sampling regimes, telemetry data exchanges. The space industry has built exactly this kind of commons before, for debris tracking and conjunction data, and that precedent carries an instruction: it worked because an anchor institution carried it as a public good rather than waiting on voluntary pooling. A radiation data commons likely needs the same, a NEPP-scale program, a standards body, or a test consortium with a government tenant, because the prisoner’s dilemma will not resolve itself out of goodwill.

Traditional radiation-hardened components are not the answer for this mission; that supply chain was built for hundred-unit programs at five-figure unit prices, not million-satellite constellations. Lockstep redundancy is not the answer either, and not because nobody thought of it: triple modular redundancy assumes deterministic, bit-identical outputs to vote on, and accelerated inference breaks that assumption before a single particle arrives, through non-associative parallel arithmetic and run-to-run scheduling variance. Determinism can be forced at a real performance cost, which is the honest phrasing: the classic mitigation does not extend at acceptable cost. The realistic reliability stack gets adapted from what terrestrial hyperscalers built for silent data corruption (fleet-scale statistical monitoring, algorithm-level checks, retry architectures) and then hardened against an environment those tools never had to price.

The orbital data center race will not be won by whoever flies the biggest radiator. Radiators are a solved problem with a mass budget. It will be won by whoever closes the loop between commercial silicon economics and space reliability physics, because that loop decides what fraction of a million satellites is delivering compute in any given hour, and at what replacement cost. The open question for the next decade is whether that loop gets closed once, privately, as a moat, or becomes infrastructure the whole industry builds on.

The physics is patient. It will wait for the industry to start pricing it.

Vincent Pribble leads product strategy for radiation-tolerant microelectronics at Mercury Systems. He previously built launch vehicles at ULA and Blue Origin, and writes about the rad-hard supply chain, qualification economics, and what happens when commercial silicon meets orbital physics. Views are his own.

Orbital Data Centers Have a Silicon Problem Nobody Is Pricing

Discussion about this post

Ready for more?