Reverse Engineer’s Perspective on the Boeing 787 ‘51 days’ Directive
ioactive.comOne thing to consider when looking at such things is that commercial avionics software systems are full of known limitations. I do not know if this particular 51-day limitation was intentional or not, but in general:
Avionics software starts with writing comprehensive requirements. When the software itself is developed based on those requirements, it is then tested against the requirements, always in a real functioning airplane, but also often in smaller airplane-cockpit-like rigs and in purely simulated environments.
Nobody is going to write a requirement that says "this avionics subsystem will function without error forever". Even if you thought you could make it happen, you can't test it. So there are going to be boundaries. You might say that the subsystem will function for X days. What happens after that? It may well run just fine for X+1 days, or 2X days, or 100X days. But it's only required to run for X days, and it's only tested and certified for running for X days.
I could easily imagine that this particular subsystem was required and certified for some value of X <=51 days, and it just so happened that if the subsystem ran for over 51 days then it started to fail. Or, it could have been a genuine mistake.
But even if the intended X wasn't 51 days, there almost certainly was some intended, finite value for X. We might say, "well, my laptop has run for three years without needing a reboot". Great! Is that a guaranteed, repeatable state of operation that the FAA would certify? Probably not. And besides that, do we really want to have to endure a three-year verification test?
In most software, we are happy to say, "it should run indefinitely". For avionics software, that's insufficient. We instead say "it will run at least for some specific predetermined finite amount of time" and then back up that statement with certifiable evidence.
I work in a field that operates under similar development constraints. (Namely it's a mature product in a mature field with well defined requirements) Because if this I regularly get calls from my customers wondering why their system can't do X or Y in the B way instead of the A way, and I have a similar conversation. Wherein I have to explain "no, that wasn't part of your requirements 5 years ago, if you want to change it, you'll need to pay us for more development", that normally eliminates the requirement for whatever it was they wanted pretty quickly.
Also, uptime is a factor, I've seen what windows looks like when it runs out of GDI objects, it's strange. But once you see it, you can explain to the customer the importance of regular reboot/restarts.
I never understood why regular, and scheduled, reboots are concidered to be a problem to begin with.
It can come with exposure of hidden costs. So a pc which can only be assured to be correct by reboot cannot continuously monitor a flow process which cannot be interrupted for that reboot window. It has to be designed to work with two, or some kind of data buffering has to be designed in, or the specification changed to redefine to continuous(*)
Which btw is what should be done but.. it can cause rage
[*] may not be continuous or complete in all circumstances
But a 787 works fine with those reboots being part of scheduled maintenance. So the issue is what exactly again?
Essentially changes in typical operation procedures at airlines broke previous assumptions about regular full aircraft power downs, which triggered both the 248 day bug and the current 51 day bug.
It used to be that an aircraft would get a full power down as often as daily, but as individual components got more reliable and external power easily available, it became common for aircraft to not be shut down fully between flight days.
Nothing. I respond to a question posing why it might be a problem. It didn't say "in a 787" it was "in general" I suggest a class of problem which it might surface in. The wider question.
All aircraft have schedules of maintenance. Requirements to reboot a computer periodically isn't onerous. It's not onerous but the insane costs of recertification are. Fixing this problem to not require reboot would be very expensive. Not just the FAA process burdens but the wider costs. 787 battery problems probably wrecked the entire profit of the model for years.
The Max flight safety issue on another Boeing aircraft may mean its never profitable. The industry is wierd.
I worked in healthcare where our EMR went into downtime for two hours on daylight transition days. It was extremely disruptive as we had to switch to a paper process for that time period that needed to get reconciled with the EMR at the end of the shift.
Unless you have a dedicated team doing that, preventative reboots and various “workarounds” sound great on paper for administrators but make for a shitty experience for people doing the actual work.
Difference berween your examole and the reboot requirements of various aircraft: aircraft reboots hapoen in controlled environments, on the ground when the aircraft is out of operations and is done be dedicated, trained and certified maintenance staff. Those reboots, while funny on first glance, do not interfere at all with aircraft operations.
Sounds about right. But it’s still a critical failure for a fault of any kind to ever display incorrect information to the pilot.
And in this case it seems one function of the software is interfering with another, which causes the incorrect display.
I highly doubt it was intentional. Boeing's already had to issue an AD for similar behavior on the 787:I do not know if this particular 51-day limitation was intentional or nohttps://www.engadget.com/2015-05-01-boeing-787-dreamliner-so...
If they knew about it there'd be no need for an AD. Boeing tried to become the aviation equivalent of a fabless chip designer with the 787 and it didn't go well at all. Turns out they had little-to-no experience managing external development and manufacturing teams. I don't know anything about the 51-day bug, but the 248-day bug caused critical failures that you really wouldn't want happening in flight.
> Nobody is going to write a requirement that says "this avionics subsystem will function without error forever".
These time limits could at least be pegged to real-life intervals to when the system is going to be shut down anyway. If the system continues to be operated past that point, skipped maintenance intervals could be underlined as the cause.
It is on fact possible to write provably correct software for safety critical applications.
Not by testing, but by using formal methods.
That's nice for the software. Now how about the hardware? How about the electronic hardware's not-exposed firmware, does that count? Did the subcontractor test it for three years at 10,000 feet for radiation-induced bit-flips? With or without lightning strikes?
> How about the electronic hardware's not-exposed firmware, does that count? Did the subcontractor test it for three years at 10,000 feet for radiation-induced bit-flips? With or without lightning strikes?
Blast the module in a radiation chamber. It can be done, it's only extremely expensive - the military has the budget (makes sense, given that a fighter jet or a bomber should be able to power through a nuclear bomb fallout), but civilian airliners are all about cost efficiency.
Including a system reboot, on the ground, as part of your on going maintenance activities is a fault, or incorrect software.
Is a roof you have to redo every 20 years, or a paint that only lasts 10 years faulty? Is a car that needs brakes replaced every X thousand kilometers faulty?
It is only faulty if it does not run according to spec, or if you it run outside the spec.
Exactly. If the manual says "reboot every 51 hours", you do just that and all is fine. If you have to reboot every, say, 25 hours, something is broken.
For example, let’s imagine that the timestamp set by the transmitting ES is close to its wrap-around value. After performing the required calculation, the receiving ES obtains a timestamp that has already wrapped-around, so it would look like the message had been received before it was actually sent.
Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today? What's more disappointing is seeing all the other incredible systemic complexity they've added, and yet the plane appears to have no mechanical backup instruments?
To address the second part:
> and yet the plane appears to have no mechanical backup instruments[?]
This is unlikely in a modern aircraft because mechanical instruments to back up e.g., the artificial horizon / attitude indicator or directional gyro (DG) / heading indicator are:
1) Mechanically complex - the attitude indicator and DG make use of gyroscopes which rotate at up to 24,000 RPM along with other mechanisms. They are typically powered by vacuum or electric motors which consume relatively more power (or require vacuum lines and a vacuum pump)
2) Expensive to maintain - see (1) - they need to be serviced somewhat regularly
(3) Heavier than their solid-state counterparts
(4) Have [dramatically] different failure modes - instead of a display going dark, a DG will slowly drift as the gyroscope precesses, giving erroneous values. Same with the artificial horizon. This can lead to catastrophic results under instrument meteorologicalconditions (IMC) where the pilots rely solely on instruments to maintain essential things such as heading and level flight.
(5) Because of (4) they require additional redundancy to ensure instruments can be cross-checked with one another. This compounds (2) and (3)
I think you are overstating the impracticality of mechanical standby instruments. Even glass cockpit GA aircraft typically came with fully mechanical backups until fairly recently---check out this SR22 cockpit as an example: https://commons.wikimedia.org/wiki/File:SR22TN_Perspective_C...
"Glass" standby instruments come with significant upside and not much downside, which is why they've been preferred in larger/more expensive aircraft for a while. There is nothing inherently more or less reliable about them, being fully isolated and redundant just as old-timey mechanical backups are, and they offer a much richer presentation (typically like a small PFD). However, new things are usually more expensive, which IIUC is why they were adopted first in larger, more expensive aircraft. They were considered a luxury in GA until fairly recently.
Plus the pilot stress of having to adjust to using dramatically different instruments when already in a difficult situation.
It's just not a workable idea in general. There are checklists for stuff like instrument failure which can probably recover from a software bug like this.
It's absolutely a workable idea. Standby instruments are typically a requirement for glass cockpit aircraft, and before electronic standby instruments came onto the scene mechanical instruments were used in the standby role in (AFAIK) all sectors of aviation.
"Fly the airplane" is the highest priority in any aviation emergency, and in many emergencies you will need backup instruments to do so. I don't mean to be mean, but tbh it is a little absurd to suggest that a e.g. a pilot who loses her PFD in IMC is better off running checklists than using backup instruments to establish control of the aircraft and situational awareness, and bailing out asap. Sure, it's stressful, but it's also something pilots need to (and do) train for.
Once the aircraft is under control, you can run your checklists, or if you have a co-pilot you may be able to work in parallel. Maybe you will be able to fix the issue, and maybe you won't, but backups give you a shot at landing safely either way.
I think backups for the electronic systems would not need the same level of redundancy as the primary systems (which presumably already have backups).
It's sort of like how you don't need RAID for your offsite backup disks, just some parity for bit-rot.
The mechanical instruments would be the (additional) redundancy. The additional weight/lines/service is indeed burdensome even without redundant mechanical systems.
> I think backups for the electronic systems would not need the same level of redundancy as the primary systems (which presumably already have backups).
If your backup is failing more often than your primary system then it's not much use as a backup.
Also, there ARE backups. There's fallback artificial horizon boxes that work independently of the rest of the system, for example.
> Isn't it surprising that modulo arithmetic, as already employed successfully in TCP sequence numbers and the like, still seems to be incorrectly implemented today
Even in TCP sequence numbers, it can be implemented incorrectly.
https://engineering.skroutz.gr/blog/uncovering-a-24-year-old...
Fascinating analysis. I know planes get used a lot, but I'm surprised that they go for such a long time without ever being powered down.
51 days seems to be approximately how often my mac dies in kernel panic or starting to be bugged by persistent software problems that go away with a restart.
I’m at 356 days of uptime on my MacBook Pro. ¯\_(ツ)_/¯
So you never install security updates? Because all Apple updates require a reboot due to their SIP "update the frozen image offline" nonsense.
IIRC security updates always required a reboot even before SIP existed
Awww ya jinxed it. And only nine days from retirement.
Sadly I'm at 2 days, 12:41 myself. I don't get many kernel panics, but this most recent reboot was in fact a panic, coincidentally. Googled the error and it came up as something that happens with M1 Mac Minis while they're sleeping. But while my machine has a M1, it is an MBP and not a Mini. And it was not sleeping. Ah well.
I just rebooted my EeePC. It had an uptime of 5.8 years. I only rebooted it to upgrade from Debian 9 to Debian 10, and I'll bump it up to Debian 11 later on. It has a broken screen, so it just sits on top of a cupboard with a couple of 4TB USB hard drives plugged into it, storing all my backups.
Last week I had to shut down my linux box for a move: up 3457 days. One day too late I guess :-)
But you typed this message 7 days ago.
I remember articles of the Airbus A350 requiring reboots every N days (150ish or so?). I remember the Patriot missile system required a reboot every 24 hours or so until they fixed the software defect which caused the time counting to drift. And I'm pretty sure there are many more such cases where devices fail if kept on for too long, even in spaces where you are supposed to fill out a lot of "paper"work + jump through a lot of defined processes like in avionics, medical, or automotive field, among a good few others (safety and all that).
We had a bug years ago that after 50 days of uptime all network sessions dropped on our devices. Apparently it was a session timer overflow in a variable. I think it was unsigned int and time was in milliseconds.
tl;dr: 51 days is the wraparound point of a signed 6-byte counter running at 33 MHz, used to invalidate stale data from instruments.
When I saw 51 days my first thought is it had to be a time rollover. Mainly because of this bug from long ago and how close the time spans are.
https://www.cnet.com/culture/windows-may-crash-after-49-7-da...
This assumes there is no margin of error baked into the 51 day rule, which surprises me.
This communication is not Boeing communicating a required maintenance interval, they're communicating a problem. It wouldn't seem natural for me for Boeing to add a random hidden margin in a problem description. When it comes to the maintenance remedy, I don't know if Boeing would do this or airlines or the FAA. Presumably the mandatory maintenance reboot interval will be much shorter than 51 days.
2^47/(32MHz) ~= 50.9 days
Not much of a margin there.
2^47/33e6 = 49.36 days. The value is so much off that makes me suspect that this is not the correct analysis, or at least that there are additional factors at play
exactly. you could find a plausible clock rate that rolls over in ~51 days with any number of assumed counter widths. this is not indicative of anything but guessing.
and there is no margin at all or negative margin, so clearly this is wrong. is the rollover period 52 days? 60? 90?
the whole article is a bunch of unsubstantiated speculation dressed up with lots of facts and details to distract ones attention.
I feel like even 51 minutes might be too long to wait before invalidating stale instrument data on an aeroplane...
All I have to say is if my firmware barf's after being up for 8.919 million years I won't care.
Please add (2020) to the title.
This is a really good analysis of the issue from the just the verbage from FAA. Well done.
Reminds me of the LAX Air Traffic Control Shutdown of 2004: https://m.slashdot.org/story/49885
Why?
Was it a cost issue?
Or was there an expectation that a regular maintenance check would occur within this time frame that involved a reboot as part of the maintenance check for diagnostics?
51 days is slightly more than 2^32 milliseconds?
If that were the issue, then they'd have to reboot it every 49.7 days, no? Waiting 51 days would trigger the problem they're trying to avoid.