NATS report into air traffic control incident details root cause and solution

22 points by bigjump 3 years ago · 20 comments

Reader

bigjumpOP 3 years ago

Full report link: https://publicapps.caa.co.uk/docs/33/NERL%20Major%20Incident...

spuz 3 years ago

Thanks. One thing that I don't quite understand is how do new waypoints get added as part of the conversion of a flight plan from ICAO4444 to ADEXP format? Does it do some kind of interpolation?
Also, it appears the error was caused by the algorithm selecting two waypoints with the same identifier as the entry and exit points into UK airspace for this particular flight plan. But it also says non-unique waypoints should be at least 4000nm apart from each other so they can be disambiguated. Since UK airspace isn't that big, shouldn't the algorithm have chosen entry and exit waypoints closer to the borders?
Edit: actually it looks like UK airspace extends a few thousands km to the west of the coastline which makes it more plausible that it covers duplicate waypoints.
- macguillicuddy 3 years ago
  
  I'm just speculating as to what points are actually added but typically the flight plan includes 'airways' in addition to waypoints. Here's an example of a plan for London Gatwick (EGKK) to Edinburgh (EGPH):
``` EGKK/08R LAM1Z LAM L10 BPK UN601 INPIP INPI1E EGPH/24 ```
And to take a section out:
``` BPK UN601 INPIP ```
  In this case, BPK and INPIP are two waypoints (one in the south of England and one in the north). UN601 is an airway that connects these two waypoints. The airway represents a ton of other waypoints between BPK and INPIP but that you don't need to specify manually in the flight plan. I suspect the additional waypoints added are the splitting of the airway into the underlying waypoints - but as I say it's speculation :-)
Rochus 3 years ago

"An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA subsystem was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers."
"At this point with both the primary and backup FPRSA-R sub-systems having failed safely the FPRSA-R was no longer able to automatically process flight plans. It required restoration to normal service through manual intervention."
How can a primary AND it's backup system fail safely??? Who specified this?
"The actions already undertaken or in progress are as follows: 3) A permanent software change by the manufacturer within the FPRSA-R sub-system which will prevent the critical exception from recurring for any flight plan that triggers the conditions that led to the incident."
Means: now they catch the (Java) exception. Great.
- jonp888 3 years ago
  
  > How can a primary AND it's backup system fail safely??? Who specified this?
  All safety critical systems are specified to halt instead of performing undefined behavior, if they encounter something that cannot be processed. An unsafe failure would be entering undefined behaviour. What would you have specified differently, that would be safer?
  A backup is primarily there in case of hardware failures or for maintenance. If it behaves differently to the primary then something is wrong. Can you explain how and why you would expect a backup system running identical software to behave differently?
  - Rochus 3 years ago
    
    I worked in safety-critical ATC projects in engineering and management positions (systems, quality and compliance engineering) for a decade. ATC systems are supposed to not fail, even under adverse conditions. Where high availability is required for safety reasons, redundant architectures is one of the options. Apparently the "backup system" was conceived for this purpose. According to the report (page 17) the responsible subsystem suffered from a "critical exception [..] that triggers the conditions that led to the incident", which let both the primary and backup system fail, and has now apparently been fixed. So obviously the system was not supposed to fail on receiving wrong or suspicious flight plan data, and it was apparently pure luck that no such data arrived for five years. To claim that the subsystem (consisting of the primary and backup system) "safely failed" indicates significant gaps in safety management (either faulty safety analyses, faulty specifications, or faulty configuration or software). The report suggests that critical omissions occured at several levels.
    
    macguillicuddy 3 years ago
    
    For me it's important to consider the 'ATC system' as the whole. The system as a whole did not fail - no planes crashed, flights still flew - but it was in a degraded state with lower than usual throughput. One component of the system did fail (the FPRSA subsystem) and it seems reasonable to me that given layers of the system lean towards unavailability rather than trying to continue to operate in unforeseen circumstances.
    The purpose of a backup system is not to prevent failure - it's to improve resiliency of the system as a whole across a set of foreseen and unforeseen faults. Backup systems failing to handle any specific fault is an expected and predicted behavior. Thankfully in this case there was a backup system that prevented a complete shutdown (and, thankfully, any accident) - the manual processing of flight plans.
    
    Rochus 3 years ago
    
    Missing the availability requirements is a failure.
    Safety is not only about human lives, but also about health and property (also e.g. critical financial and other losses, or reputational damage). The present incident has obviously caused considerable damage. We can only hope that the rest of the system does not suffer from similar omissions and that it is not pure coincidence that even worse events occur.
    
    macguillicuddy 3 years ago
    
    Yeah of course, but success/failure is also not binary. There are degrees of failure, including low-consequence availability issues, high-consequence availability issues, loss of operational safety, 'never events' (e.g. significant loss of life). In this case the system suffered the second of those options. It seems reasonable that design choices may prioritise that type of failure over the later ones in the list.
    The first part of this argument is semantics - how do we define failure. The second part is IMHO more important - what decisions are taken with regards to the behavior of subsystems and how they influence overall system degredation. In this case the overall design prevented any loss of operational safety which, to me, is a success.
    
    Rochus 3 years ago
    
    One can talk things up or fix them. As the report (and some comments) suggests, the former is given high priority.

Pages 8 & 9 of the full report have the details of what happened.

> … it was found to have encountered an extremely rare set of circumstances presented by a flight plan that included two identically named, but separate waypoint markers outside of UK airspace.

> This led to a ‘critical exception’ whereby both the primary system and its backup entered a fail-safe mode. The report details how, in these circumstances, the system could not reject the flight plan without a clear understanding of what possible impact it may have had. Nor could it be allowed through and risk presenting air traffic controllers with incorrect safety critical information.

A flight plan came in that had a duplicated waypoint ID at either end of the route. The flight-plan software, when trying to extract the UK portion of an overflight (origin & destination outside UK airspace), ended up focusing on both of those (identically-named but geographically-distinct) waypoints. Software thought they were duplicates, couldn't figure out what the UK portion of the flight plan was, and intentionally crashed. It did so, rather than reject the flight plan for an aircraft that may already be in the air.

Rochus 3 years ago

And they shut down the entire system because of one incoherent plan? What a great example of an ingenious high-availability architecture.
- gchadwick 3 years ago
  
  > rather than reject the flight plan for an aircraft that may already be in the air
  I wonder why they can't reject the flight plan for an aircraft that's already in the air? Presumably they have any number of reasons to reject a perfectly valid flight plan that's been submitted yet alone invalid ones there must be a rejection mechanism?
  The explanation offered by the report (from a quick skim found this on page 9) is:
  > Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points. > ... > In this case the software within the FPRSA-R subsystem was unable to establish a reasonable course of action that would preserve safety and so raised a critical exception
  The failure is portrayed as a reasonable thing to do and yes it's good the system failed safe rather than continued with a bunch of corrupt data no-one knew about but it seems bizarre that a single dodgy flight plan resulting in the whole system having to shut-down was an intentional part of the system design. It does sound like they don't have strong isolation around individual flight plan processing so an exception thrown there just propagated up to bring the whole thing down.
  More damningly the duplicated waypoint names with different positions is a known issue with work on-going to produce a globally unique set of names (from what the report says) so this is hardly unexpected. Surely any decent test plan would have included this scenario?
  - longwave 3 years ago
    
    > I wonder why they can't reject the flight plan for an aircraft that's already in the air?
    You need to know everything that may be in the air - if you skip the details of a flight that may be in the air, you risk routing another flight through the same space and the possibility of collision? So if you can't do that safely, the only option is to shut down; existing flights can continue but no new flights can be routed until the anomaly is resolved.
  - Rochus 3 years ago
    
    > and yes it's good the system failed safe rather than continued with a bunch of corrupt data
    The authors of the report obviously made an effort to suggest this; but then on page 18 they nevertheless admit that "A permanent software change by the manufacturer within the FPRSA-R sub-system which will prevent the critical exception from recurring for any flight plan that triggers the conditions that led to the incident.".

CaliforniaKarl 3 years ago

As an example of duplicate airspace waypoints ("fixes"): Head over to https://opennav.com/, and search "PINTO". You'll find the identifier being used for a waypoint in the United States, in Columbia, and in Chile.

In general: waypoints are five letters, VORs & similar are three letters, and NDBs are two or three letters.

This is an example of how older forms of identification come under stress in a modern world. It never mattered if you had a duplicate-named waypoint many countries away away; waypoints were defined by intersecting lines (typically relative to two VORs), or by a set distance from a reference point (such as a VOR/DME). Plotting a route would make it obvious how the different waypoints fit in, relative to the start/end and intermediate navigational aids (VORs etc.).

But then waypoints started getting GPS coördinates, and were collected into large databases. It's a problem that has been known since it became a problem, but it still causes issues (like leap seconds!).

darkclouds 3 years ago

> This is the root cause of the incident. We can therefore rule out any cyber related contribution to this incident.

They were hacked. Obviously this is the first time this flight path has been filed, otherwise it would have crashed earlier.

Phone Phreaking, whilst not technically a cyber attack, is still a form of hacking of phone systems.

And email bombs exist which take out email servers and readers, and zip bombs https://en.wikipedia.org/wiki/Zip_bomb

So was this the first instance of a flight path being used as a denial of service and of course the "Blitish" playing down its significance because it doesnt want to offend any one due to its current isolated precarious state?

pixelpanic360 3 years ago

Why these baseless conspiracy stands?
- pja 3 years ago
  
  Because this post has fallen off the front page, so almost no one is reading it to downvote nonsense like this.

Settings

NATS report into air traffic control incident details root cause and solution

Keyboard Shortcuts