What These Ten Stories Share
The patterns behind industrial disasters, the humans inside the systems — and a note from the author
By VastBlue Editorial · 2026-03-26 · 19 min read
Series: What Really Happened · Episode 11
Ten Stories, One Argument
Over ten episodes, we investigated industrial disasters, near-misses, and operational saves that span five decades, four continents, and nearly every domain of critical infrastructure. A fuel depot in Hertfordshire that exploded because a gauge stuck and a backup switch had failed. A power grid across the northeastern United States and Ontario that collapsed because a software bug silenced the alarms. A nuclear plant on the Pacific coast of Japan whose catastrophic flooding was a direct consequence of a cost-saving excavation decision made forty years earlier. An airliner over the Atlantic that fell from the sky because its pilots did not understand what their automation was doing. A reactor in Soviet Ukraine that was destroyed by the very safety test designed to prove it was safe. A radiation therapy machine that killed patients because its software had never been tested against the hardware it controlled. A flood barrier on the Thames that represented the opposite of every other story in this series — a forty-year decision that worked. A Boeing 767 over Manitoba that ran out of fuel at 41,000 feet because of a unit conversion error and was saved by a pilot who happened to know how to glide.
These stories differ in almost every particular. They involve petrochemicals, electricity, nuclear fission, aviation, software, civil engineering, and medical devices. They took place in England, the United States, Japan, the mid-Atlantic, Ukraine, Canada, and on the Thames. Some killed nobody. Some killed thousands. One saved eight million people by working exactly as designed. But strip away the surface differences — the specific industries, the specific technologies, the specific dates and geographies — and a set of structural patterns emerges that is remarkably consistent. These patterns are not new observations. They have been documented extensively in the safety science literature, from Charles Perrow's Normal Accidents to James Reason's Swiss cheese model to Sidney Dekker's work on drift into failure. What this series has attempted to do is not discover these patterns but demonstrate them — to show, through forensic narrative, what the theoretical frameworks describe in abstract terms.
What follows is an attempt to state those shared patterns plainly, to explain why they recur with such stubborn consistency, and to draw out the argument that ten episodes collectively make about how complex systems fail, how they are investigated, and how — sometimes — they are made safer.
The Drift
The single most consistent pattern across every disaster in this series is what the sociologist Diane Vaughan, studying the Challenger explosion, called the normalisation of deviance. In every case, the conditions that produced the catastrophe did not appear suddenly. They developed over months, years, or decades, through a process of gradual drift in which small deviations from designed operating parameters became accepted as normal, then routine, then invisible.
At Buncefield, the independent high-level switch on Tank 912 — the last line of automated defence against overflow — had been in a degraded state for an extended period before the explosion. Maintenance records showed a pattern of deferred repairs and recurrent faults that had been normalised into the operational background. The gauge that stuck was not an unprecedented event. It was the final manifestation of a maintenance culture that had been drifting for years. At Fukushima Daiichi, the decision to excavate the coastal bluff and lower the plant site by twenty-five metres — a cost-saving measure that placed the reactor buildings closer to sea level and significantly reduced the margin of protection against tsunami — was made during the original construction planning in the 1960s. In the decades that followed, multiple internal and external reviews identified the tsunami risk, including a 2008 internal study that estimated a possible fifteen-metre wave. Each time, the risk was assessed, discussed, and ultimately deferred. The drift was not a single decision. It was forty years of decisions not to act on accumulating evidence.
Chernobyl followed the same pattern at an institutional level. The RBMK reactor design had a known positive void coefficient at low power — a characteristic that made the reactor inherently unstable under exactly the conditions that the safety test required. This was not a secret. It was a design feature that had been documented, discussed, and effectively normalised within the Soviet nuclear establishment. The operators who ran the test on the night of 26 April 1986 did not know about the void coefficient in the way that mattered — they did not understand that it could destroy the reactor — because the institutional culture had normalised the gap between what the design documentation said and what operators were told. The drift was not in the reactor physics. It was in the information architecture of the organisation that operated it.
The Therac-25 demonstrated the same normalisation process in software. The race condition between the machine's software and its hardware — the bug that allowed the electron beam to fire at full intensity without the beam-spreading scanning magnets in place — had been present since the machine's software was written. It had not been caught during testing because the testing process was inadequate. More precisely, there had been no integrated testing of the software against the actual hardware behaviour under rapid operator input. The absence of such testing was not a conscious risk acceptance. It was a drift — from the Therac-20, which had hardware interlocks that made the software's behaviour irrelevant, to the Therac-25, which removed those interlocks and trusted the software to be correct. Nobody made a deliberate decision to remove safety-critical hardware interlocks and replace them with untested software. The change happened incrementally, across product generations, and the risk it created was never visible because it had never been tested for.
The Northeast Blackout of 2003 was perhaps the purest illustration of drift at the systems level. The software bug in FirstEnergy's XA/21 energy management system — a race condition in the alarm and logging subsystem — had been introduced during a software update and had lain dormant for weeks before August 14. When it activated, it silenced the alarms that would have told control room operators that transmission lines were tripping. But the software bug was only the proximate cause. The deeper drift was in the reliability standards governing the interconnected grid. Compliance with NERC reliability standards was voluntary. FirstEnergy's vegetation management programme — the routine tree-trimming that prevents transmission lines from sagging into overgrown branches — had been allowed to lapse. The trees that caused the initial line faults had been growing towards the conductors for years. The drift was botanical, organisational, regulatory, and digital, all at the same time.
The conditions that produce catastrophe do not appear suddenly. They develop over months, years, or decades, through a process of gradual drift in which small deviations from designed operating parameters become accepted as normal, then routine, then invisible.
After Diane Vaughan, The Challenger Launch Decision
The Gap Between Design and Operation
A second pattern recurs with striking consistency: in every disaster, there was a gap between how the system was designed to work and how it was actually operated, and that gap was where the failure lived. This is not the same as operator error, though it is often mischaracterised as such. The gap is structural. It exists because systems are designed under one set of assumptions and operated under another, and the distance between the two sets of assumptions grows over time as the system, the organisation, and the operating environment evolve while the design documentation does not.
Air France 447 is the most vivid illustration. The Airbus A330's flight control law — the software that mediates between pilot inputs and control surface movements — was designed with a set of protections that prevented the aircraft from exceeding its aerodynamic envelope under normal conditions. When the pitot tubes iced over and the airspeed data became unreliable, the flight control law reverted from Normal Law to Alternate Law, which removed several of these protections, including the angle-of-attack protection that would have prevented the aircraft from stalling. The pilots — specifically the co-pilot flying the aircraft — did not fully understand what Alternate Law meant, what protections had been removed, or why the aircraft was responding differently to his inputs. He pulled back on the sidestick, raising the nose and increasing the angle of attack, because in Normal Law, the flight control system would have prevented the aircraft from stalling regardless of his inputs. In Alternate Law, it did not. The gap between the designed behaviour and the pilot's mental model of that behaviour was the space in which 228 people died.
The Gimli Glider revealed the same gap from a different angle. Air Canada's fleet was in transition from imperial to metric units. The Boeing 767 was the airline's first metric aircraft. The fuelling procedures — designed for imperial measurement — had not been fully updated for the metric system. The ground crew calculated fuel load using the wrong conversion factor (pounds per litre instead of kilograms per litre), a mistake that was compounded by the fact that the aircraft's fuel quantity indication system was inoperative and had been deferred under a Minimum Equipment List provision that permitted dispatch with the system inoperative, provided fuel was confirmed by manual dipstick measurement. The system was designed to work with functional fuel gauges. It was operated without them. The gap between the design assumption and the operational reality was filled with a manual calculation, and the manual calculation contained a unit conversion error that left the aircraft with less than half the fuel it needed.
At Chernobyl, the gap was between the reactor's designed operating envelope and the test protocol's requirements. The safety test required the reactor to operate at a power level — roughly 200 megawatts thermal — that placed it in a region of the power curve where the positive void coefficient was at its most dangerous. The test protocol had been written by electrical engineers interested in the turbine's coastdown characteristics, not by reactor physicists who understood the implications of low-power operation for an RBMK reactor. The test was designed for the electrical system. It was executed on a nuclear reactor. The gap between those two frames of reference was the space in which the explosion occurred.
The Thames Barrier is instructive precisely because it was designed with this gap in mind. The engineers who designed the barrier in the 1970s knew that the Thames tidal levels were rising and that climate projections were uncertain. They designed the barrier not for the conditions of 1982, when it was completed, but for the conditions of 2030 — and then added margin beyond that. The design explicitly accounted for the gap between present knowledge and future reality. The barrier has been closed over two hundred times. It has never failed. The difference between the Thames Barrier and Buncefield, or Fukushima, or Chernobyl, is not that the barrier was built by better engineers. It is that the barrier was designed by engineers who assumed the gap would exist and built the margin to survive it.
The Human in the Loop
Every story in this series is, at some level, a story about humans inside systems. Not humans versus systems. Not humans failing systems. Humans inside systems — affected by the system's design, constrained by the system's information architecture, responding to the system's incentives, and making decisions with the information the system provides, which is never the information they need.
The Air France 447 co-pilot pulled back on the sidestick because nothing in his training, his experience, or the aircraft's feedback systems told him clearly that the aircraft was stalling. The angle-of-attack information was available on the cockpit instruments in a limited form, but the stall warning — which should have been the most salient cue — was behaving paradoxically: it sounded when the nose was lowered (because the airspeed increased enough for the system to consider the data valid) and stopped when the nose was raised (because the airspeed dropped below the system's validity threshold, causing the warning to be suppressed). The pilot was receiving an inverted signal. The system was telling him the opposite of the truth. His response — pulling back — was wrong in aerodynamic terms, but entirely rational given the information the system was giving him.
The Chernobyl operators disabled safety systems during the test because the test protocol told them to. The protocol was authorised. It had been reviewed. The operators were following instructions. The instructions were wrong — not because they contained a typographical error, but because they were written by people who did not fully understand the system they were testing. The operators' compliance with the protocol was not recklessness. It was obedience to an authorised procedure that happened to be lethal, in a culture where questioning authorised procedures carried professional and potentially political consequences.
The Gimli Glider offers the mirror image: a case where human skill, experience, and improvisation saved a situation that the system had made unrecoverable. Captain Robert Pearson had extensive gliding experience — an unusual skill for a commercial airline pilot. When both engines flamed out at 41,000 feet, he knew how to fly without power. He knew the aircraft's glide ratio. He knew how to manage energy. He improvised a forward slip to lose altitude rapidly on final approach — a technique from small-aircraft flying that had never been performed in a Boeing 767. Sixty-nine people survived because one pilot had a weekend hobby that turned out to be the most relevant skill in an emergency that the entire aviation industry had declared impossible.
Every disaster in this series is a story about humans inside systems — affected by the system's design, constrained by its information architecture, responding to its incentives, and making decisions with the information the system provides, which is never the information they need.
The Therac-25 case reveals a particularly disturbing dimension of the human-system interaction. When patients reported burning sensations during treatment, the machine's operators checked the console displays, which showed no errors. The machine said nothing was wrong. The operators trusted the machine over the patients. This was not incompetence. It was the predictable consequence of a system design that provided no diagnostic information about the beam's actual state, combined with a training culture that emphasised the machine's reliability. The operators had been taught, implicitly and explicitly, that the machine was safe. When the machine and the patient disagreed, the operators believed the machine. Three patients died before anyone believed the patients.
The Northeast Blackout shows humans removed from the loop entirely — not by choice, but by system failure. When the XA/21 alarm system crashed, the control room operators at FirstEnergy lost their primary source of real-time grid state information. They could not see the transmission lines tripping. They could not see the cascading overloads. They were, in Sidney Dekker's phrase, "looking at the wrong display" — not because they chose the wrong display, but because the right display had silently stopped updating. The system failed without announcing its failure. The humans in the loop had no way to know they were no longer in the loop.
Defence in Depth — and Its Limits
The concept of defence in depth — multiple independent layers of protection, each capable of preventing an accident independently — is the foundational principle of industrial safety engineering. Every system described in this series was designed with defence in depth. Buncefield had a primary level gauge, an independent high-level switch, bund walls, and fire suppression systems. Fukushima had seawalls, backup diesel generators, battery systems, and multiple cooling pathways. The Airbus A330 had redundant pitot tubes, redundant flight computers, redundant hydraulic systems, and multiple flight control laws. The Therac-25 had software interlocks designed to prevent beam misalignment. The power grid had redundant transmission paths, automatic protection relays, and real-time monitoring systems.
In every disaster, the defences failed. Not one defence — all of them, or enough of them to make the remainder irrelevant. This is the core insight of James Reason's Swiss cheese model: each layer of defence has holes, and accidents occur when the holes align. But what this series demonstrates, through case after case, is something the Swiss cheese model implies but does not always make explicit: the holes are not random. They are correlated. They are correlated because the same organisational culture, the same maintenance practices, the same regulatory regime, and the same economic pressures act on all the defences simultaneously.
At Buncefield, the gauge that stuck and the switch that failed were maintained by the same maintenance programme, funded by the same budget, overseen by the same management. If the organisational culture was one in which deferred maintenance was normalised, that normalisation applied to both the primary gauge and the independent switch. The defences were designed to be independent. They were not independent in practice, because they shared a common cause: the organisation that maintained them.
At Fukushima, the seawall, the diesel generators, and the battery systems were all sized for the same design-basis tsunami height. When the actual wave exceeded the design basis, all three defences failed simultaneously — not because of three separate failures, but because of a single shared assumption about maximum wave height that was embedded in the design of every layer. The defences were not independent. They were correlated by their shared design assumption. When the assumption proved wrong, all the layers failed at once.
The Thames Barrier again provides the counter-example. Its designers did not assume they knew the maximum tidal level. They designed for a level significantly above the best estimates available at the time, and they built the structure to be adaptable — capable of being raised further if future conditions demanded it. The barrier's defence in depth was not just physical (the gates, the sills, the operating mechanisms) but epistemic: the design acknowledged the limits of current knowledge and incorporated margin for what was not yet known. This is the difference between defence in depth as a checklist of independent barriers and defence in depth as an engineering philosophy that accounts for common-cause failures and epistemic uncertainty.
What Changed — and What Did Not
Every disaster in this series produced an investigation, and every investigation produced recommendations, and most of those recommendations produced regulatory or procedural changes. Buncefield led to the Buncefield Standards Task Group and significant revisions to COMAH site safety practices across the UK. The Northeast Blackout led to the Energy Policy Act of 2005, which made compliance with NERC reliability standards mandatory rather than voluntary — a single-word change (from "voluntary" to "mandatory") that restructured the regulatory architecture of the North American power grid. Fukushima led to fundamental reassessments of nuclear safety worldwide, including the EU's stress test programme and, in some countries, decisions to phase out nuclear power entirely. Air France 447 led to changes in pilot training for manual flying skills, high-altitude stall recovery procedures, and airspeed indication system design. Chernobyl led to the creation of the World Association of Nuclear Operators and contributed to the political pressures that accelerated the end of the Soviet Union. The Therac-25 led to the FDA's first serious engagement with software as a safety-critical component in medical devices.
These changes are real and significant. Lives have been saved because of them. The question this series forces us to confront is whether the changes address the patterns or merely the specifics. Buncefield led to better gauge maintenance and testing protocols at fuel depots. But did it address the normalisation of deferred maintenance across all industries? The Northeast Blackout made grid reliability standards mandatory. But did it address the systemic tendency to allow software-dependent monitoring systems to be deployed without adequate failure-mode testing? Fukushima led to higher seawalls and relocated backup generators. But did it address the institutional dynamics that allowed forty years of accumulating evidence to be deferred, discussed, and ultimately ignored?
The honest answer is: sometimes. The regulatory changes following these disasters have been, in most cases, technically sound and operationally effective within their specific domain. Fuel depots are safer after Buncefield. The North American grid is more resilient after the 2003 blackout. Pilot training is better after Air France 447. But the structural patterns — the drift, the gap between design and operation, the correlated failures of nominally independent defences, the humans trapped inside information architectures that do not serve them — these patterns recur because they are features of complex sociotechnical systems, not bugs in specific industries.
The Thames Barrier remains the most instructive case because it was designed before a disaster, not after one. Its engineers did not need a catastrophic flood to motivate good design. They looked at the evidence — the rising tidal levels, the storm surge data, the precedent of the 1953 North Sea flood that killed over 300 people in England and more than 1,800 in the Netherlands — and they designed a system with sufficient margin and adaptability to handle conditions they could not precisely predict. The barrier works not because its engineers were smarter than the engineers at Buncefield or Fukushima, but because the institutional context — the political will following the 1953 flood, the long-term funding commitment, the engineering culture of the project — supported the kind of conservative, margin-rich design that drift and normalisation tend to erode.
The lesson is not that disasters are inevitable. It is that the patterns that produce disasters are persistent, and that addressing them requires not just better equipment, better regulations, or better training, but a sustained institutional commitment to seeing the drift before it reaches the cliff edge. That commitment is expensive, unglamorous, and politically difficult to maintain. It requires organisations to invest in safety margins they hope will never be used, to test failure modes they hope will never occur, and to maintain vigilance against normalisation in the absence of any visible threat. It requires, in short, the engineering culture that built the Thames Barrier — not as a one-time achievement, but as a continuous practice.
A Note from the Author
We are VastBlue Innovations, based in Funchal, Madeira, building AI systems for the industries this series explored — energy, utilities, manufacturing, critical infrastructure. The kind of industries where complex systems interact with physical reality and where the consequences of failure are measured not in lost revenue but in lost lives.
We wrote this series because we believe that anyone who builds systems for critical industries has an obligation to understand how those systems fail. Not in the abstract — not as a chapter in a textbook or a slide in a safety presentation — but forensically, in specific detail, with the evidence laid out and the chain of causation traced from the organisational drift to the operational failure to the investigation that followed. The stories in this series are not historical curiosities. They are case law. They are the accumulated evidence base that tells us what goes wrong, why it goes wrong, and what the conditions look like before it goes wrong.
At VastBlue, we build agentic AI systems that integrate with industrial data — sensor networks, SCADA systems, operational databases, maintenance records. The patterns we documented in this series — the normalisation of deviance, the gap between design and operation, the correlated failures of nominally independent defences, the humans trapped inside inadequate information architectures — are not just historical observations. They are design requirements. Every system we build must be designed with the assumption that drift will occur, that the gap between intended and actual operation will grow, and that the information architecture must serve the human operator, not the other way around.
Anyone who builds systems for critical industries has an obligation to understand how those systems fail. Not in the abstract. Forensically. The stories in this series are not historical curiosities. They are case law.
VastBlue Editorial
The editorial methodology behind What Really Happened reflects the same conviction that drives our engineering work. We read the investigation reports — the full reports, not the executive summaries. We studied the MIIB's three-volume Buncefield report, the BEA's Air France 447 final report, the NAIIC's Fukushima investigation, the INSAG reports on Chernobyl, the Leveson and Turner analyses of the Therac-25, the U.S.-Canada Power System Outage Task Force report on the 2003 blackout. These documents are dense, technical, and often hundreds of pages long. They contain information that secondary sources systematically lose. The difference between reading an investigation report and reading a news article about an investigation report is the difference between understanding a failure and having an opinion about it.
This series has been ten episodes on what goes wrong. It has also been, implicitly, ten episodes on what right looks like — because every investigation, every finding, every recommendation describes not just the failure but the standard against which the failure is measured. The Thames Barrier exists in this series not as a feel-good story but as evidence that complex systems can be designed well, operated well, and maintained well over decades. The Gimli Glider exists not just as a disaster story but as evidence that human skill, judgement, and improvisation remain essential even — especially — in highly automated systems. The stories of failure and the stories of success make the same argument: that the quality of engineering decisions determines whether complex systems protect people or harm them, and that those decisions are made not in moments of crisis but in the years and decades of design, maintenance, and organisational culture that precede them.
What Comes Next: Made in Europe
This was Series 5 — What Really Happened. Ten episodes on industrial disasters, near-misses, and operational saves. Engineering post-mortems. What failed, why it failed, what the investigation found, and what changed because of it.
Next, we begin Series 6 — Made in Europe. A different kind of series entirely. Where What Really Happened was about systems that failed, Made in Europe is about systems that succeeded — European companies and products that compete at the highest level on the global stage, built by engineers whose stories have rarely been told. From ASML's lithography machines to Spotify's audio infrastructure, from Airbus's fly-by-wire systems to the CERN accelerator complex, Made in Europe will profile the technology, the engineering decisions, and the people behind European innovations that the world depends on but rarely credits to Europe.
The two series are connected. What Really Happened demonstrated what happens when engineering culture fails. Made in Europe will demonstrate what happens when it succeeds. Both series ask the same underlying question: what does it take to build complex systems that work? The answer, as ten episodes of forensic investigation have shown, is not genius or heroism. It is discipline, margin, institutional commitment, and the willingness to take the long view. That is what the best European engineering has always offered. Made in Europe will show what it looks like when those qualities are applied not to prevent disaster but to build something extraordinary.
Sources
- Perrow, C. — Normal Accidents: Living with High-Risk Technologies (1984) — https://press.princeton.edu/books/paperback/9780691004129/normal-accidents
- Reason, J. — Managing the Risks of Organizational Accidents (1997) — https://www.routledge.com/Managing-the-Risks-of-Organizational-Accidents/Reason/p/book/9781840141047
- Vaughan, D. — The Challenger Launch Decision (1996) — https://press.uchicago.edu/ucp/books/book/chicago/C/bo22781921.html
- Dekker, S. — Drift into Failure: From Hunting Broken Components to Understanding Complex Systems (2011) — https://www.routledge.com/Drift-into-Failure/Dekker/p/book/9781409422211
- Leveson, N. — Engineering a Safer World: Systems Thinking Applied to Safety (2011) — https://mitpress.mit.edu/9780262016629/engineering-a-safer-world/
- Hollnagel, E. — Safety-I and Safety-II: The Past and Future of Safety Management (2014) — https://www.routledge.com/Safety-I-and-Safety-II/Hollnagel/p/book/9781472423085
- MIIB — Buncefield Major Incident Investigation Board Final Report (2008) — https://www.hse.gov.uk/comah/buncefield/miib-final-report.pdf
- BEA — Final Report on the Accident on 1st June 2009 to the Airbus A330-203, Flight AF 447 (2012) — https://www.bea.aero/docspa/2009/f-cp090601.en/pdf/f-cp090601.en.pdf
- NAIIC — The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission (2012) — https://warp.da.ndl.go.jp/info:ndljp/pid/3856371/naiic.go.jp/en/report/
- INSAG-7 — The Chernobyl Accident: Updating of INSAG-1 (1992) — https://www.iaea.org/publications/3786/the-chernobyl-accident-updating-of-insag-1
- U.S.-Canada Power System Outage Task Force — Final Report on the August 14, 2003 Blackout (2004) — https://www.energy.gov/oe/downloads/blackout-2003-final-report-august-14-2003-blackout-united-states-and-canada-causes-and
- Leveson, N.G. and Turner, C.S. — An Investigation of the Therac-25 Accidents (1993) — https://ieeexplore.ieee.org/document/274940
- Transportation Safety Board of Canada — Final Report on Air Canada Flight 143 (1985) — https://www.tsb.gc.ca/eng/rapports-reports/aviation/1983/a83h0003/a83h0003.html
- Gilbert, S. and Horner, R. — The Thames Barrier (1984) — https://www.icevirtuallibrary.com/doi/book/10.1680/ttb.01111