In space, no one can hear you kernel panic (2020)

88 points by p0u4a 2 months ago · 26 comments

Reader

dfox 2 months ago

> running identical software on multiple computer systems is the name of the software-architecture game

In the railway signalling industry (which for historically obvious reasons is obsessed with reliability) there even is a pattern of running different software implementing the same specification, written by different team, running on a different RTOS and different CPU architecture.

superxpro12 2 months ago

This is also true of the space shuttle. The failover '5th' processor was running an implementation done by a completely different sandboxed team to hedge against institutional or systemic errors not caught by the first team. So much thought put into these systems.
This, in the context of 'modern vehicle safety standards' still makes me cringe when considering the "safety" put into modern autonomous vehicle systems.

somat 2 months ago

"From the dawn of the Space Age through the present, NASA has relied on resilient software running on redundant hardware to make up for physical defects, wear and tear, sudden failures, or even the effects of cosmic rays on equipment."

An interesting case study in this domain is to compare the Saturn V Launch Vehicle Digital Computer with the Apollo Guidance Computer

Now the LVDC, that was a real flight computer, triply redundant, every stage in the processing pipeline had to be vote confirmed, the works.

https://en.wikipedia.org/wiki/Launch_Vehicle_Digital_Compute...

Compare the AGC, with no redundancy. a toy by comparison. But the AGC was much faster and lighter so they just shipped two of them(three if you count the one in the lunar module) and made sure it was really good at restarting fast.

There is a lesson to be learned here but I am not sure what it is. Worse is better? Can not fail vs fail gracefully?

anonymous_user9 2 months ago

The command module and lunar module each had one AGC. (The lunar module did include a simpler backup computer called the Abort Guidance System.)
I think this is because an AGC failure is recoverable in most phases of flight, while an LVDC failure is not.
baud147258 2 months ago

> Worse is better?
Maybe if you know what the tradeoffs are and are ready to deal with the deficiencies (by rebooting fast). And didn't they had issues with the lunar module Guidance Computer on the first moon landing?
KurSix 2 months ago

I think the lesson is that redundancy can exist at different layers
budman1 2 months ago

It all depends on the failure.
A transient bit failure in digital circuits? Then reboot and away you go.
A coding / algorithmic defect. Reboot and you are back in the same place.
Also, the AGC was directly interfaced to an astronaut. They could decide to ignore erroneous outputs from the AGC.
throwup238 2 months ago

> There is a lesson to be learned here but I am not sure what it is.
Restart your Claude Code sessions as often as possible

thomascountz 2 months ago

OT: I really enjoyed The Increment when it was first being released. It felt like the first software engineering practitioner's publication and introduced me to a lot of new people to follow.

KurSix 2 months ago

The contrast with modern software development is striking. Today we often rely on fast iteration and patching problems in production. Spacecraft software is the opposite

wongarsu 2 months ago

On the other hand a lot of SpaceX's success can be attributed to applying modern software development methodology on spacecraft. They are very much doing agile development, betting on velocity enabling fast iteration.
That has lead to some of the best rockets ever developed, and the largest satellite constellation by far. But part of the secret sauce is creating situations where you can take risks. Traditionally anything space-related deals in one-offs or tiny production volumes, so any risk is expensive. A lot of SpaceX's strategy is about changing this, whether that's by testing in flight phases the customer doesn't care about, being their own best customer to have lower-risk flights, or building constellations so big that certain failure scenarios aren't a big issue (while other scenarios still have to be treated as high-risk high-impact)
- superxpro12 2 months ago
  
  I recall an early deep-dive into their safety architecture on the falcon 9, which was basically "throw 3 COTS processors at it and reboot anything that doesnt work, and fail fast during development". I remember they explicitly avoided rad-hard processors as well.
  I would love to update myself if anyone has a good source.
  For better or worse, it's hard to argue with results.
  - budman1 2 months ago
    
    maybe they are in a 'sweet spot'. spaceX is not on the bleeding edge of anything; rather they are optimizing existing solutions. incremental design changes, in a problem domain that has been studied for decades, and is well known, will provide results. "web dev" for an e-commerce platform will show great improvement with an agile, move fast development process.
    change the fundamental nature of the propulsion, or a step change in the technology, and it may be more effective to go with an engineered approach.
    'engineered approach' --> before the item is built, a very good idea of how it is going to work has been determined. using math and science.
  - whattheheckheck 2 months ago
    
    Imagine trying to explain to 1960s tax payers were going to build and blow up multiple rockets for research velocity and dev feedback loops

rkagerer 2 months ago

Today, that could also be a great title for a commentary about datacenters in space.

throwaradfy5745 2 months ago

How would these considerations affect Musk's space cloud ?

gostsamo 2 months ago

The same way it will affect the incoming mission to the center of the galaxy. The space cloud is much more related to the incoming SpaceX ipo than to any phenomena of the physical or computing universes. Thermodynamics says "no".
rogerrogerr 2 months ago

Starlink very likely leans toward “many cheaper satellites that may fail” instead of “fewer expensive satellites that are less likely to fail”
Their advantage in the satellite-internet industry is that they can launch stuff fast and cheap; very likely this drives different tradeoff decisions than the regime this article talks about.
- Panzerschrek 2 months ago
  
  Having thousands of satellites also allows finding more software bugs, so that in the reality they can be more reliable compared to NASA-style probes (when each one has its unique software).
- phanarch 2 months ago
  
  The Starlink tangent misses something important about why software reliability in satellite systems is categorically different from hardware reliability.

gnabgib 2 months ago

(2020)

shadowbyte17 2 months ago

interesting point about patching in production – it's a totally different mindset. we had a similar issue with a legacy system at my old job, felt like a constant firefighting situation.

adampunk 2 months ago

Do not attempt to adjust your television. We control the horizontal. We control the vertical.

We know Glenn is loquacious.

Settings

In space, no one can hear you kernel panic (2020)

Keyboard Shortcuts