The Therac-25
When software killed patients
By VastBlue Editorial · 2026-03-26 · 20 min read
Series: What Really Happened · Episode 7
The Machine That Trusted Its Own Code
The Therac-25 was a medical linear accelerator — a machine that generates beams of high-energy radiation to destroy cancerous tumours in patients. It was manufactured by Atomic Energy of Canada Limited, known as AECL, and the French company CGR. It entered clinical service in 1982. By the standards of the medical device industry, it was a modern and sophisticated machine. It could operate in two modes: a low-energy electron beam mode, used for shallow tumours near the skin surface, and a high-energy X-ray mode, used for deep-seated tumours that required greater penetrating power. The ability to switch between these modes — to treat both shallow and deep cancers with a single machine — was a significant clinical advantage.
The distinction between the two modes matters enormously. In electron mode, the machine fired electrons directly at the patient at relatively low energy. In X-ray mode, it fired electrons at far higher energy into a tungsten target, which converted the kinetic energy into X-rays. The raw beam emerging from this target was dangerously intense and narrow. Before it reached the patient, it had to pass through a flattening filter and an ion chamber that measured dose delivery. If the machine fired its full-energy electron beam without the target and filter in place, the patient would receive not a therapeutic dose of X-rays, but a concentrated, unfiltered blast of electrons at full accelerator power — hundreds of times the intended dose, delivered in a fraction of a second.
The Therac-25 had two predecessors: the Therac-6 and the Therac-20. Both machines solved this safety problem the way engineers had done for decades: with hardware interlocks. Physical, electromechanical devices that independently prevented the machine from firing if the beam path was not correctly configured. These interlocks operated independently of any software. They could not be bypassed by a programming error or a timing glitch. If the target was not in position, the interlock physically blocked the beam.
The Therac-25 removed them. AECL's designers decided that the hardware interlocks were no longer necessary because the software controlling the machine would ensure safe operation. The PDP-11 minicomputer running the Therac-25's control software would check the beam configuration, verify the turntable position, monitor dose delivery, and prevent any unsafe combination of settings. The software would be the interlock. This decision — to replace proven, independent, hardware-based safety mechanisms with software — was the foundational design choice from which everything else followed.
Kennestone, Yakima, Hamilton, Tyler
The first known accident occurred on 3 June 1985 at the Kennestone Regional Oncology Center in Marietta, Georgia. A sixty-one-year-old woman was being treated for breast cancer. During treatment, the machine shut down with a "Malfunction 54" error message displayed on the operator's console. The operator — as she had been trained to do when encountering these relatively common malfunction codes — pressed the "P" key to proceed, restarting the treatment. The machine shut down again. She pressed proceed again. The patient reported feeling a burning sensation and described it as an "electric shock." She had been massively overdosed. Over the following months she developed radiation burns to her left breast, arm, and shoulder. She required a mastectomy and lost the use of her shoulder. She survived, but with permanent injury.
AECL investigated. They could not reproduce the problem. They concluded that the malfunction was likely caused by a transient hardware fault — perhaps electrical noise or a micro-switch failure. They made minor modifications and returned the machine to service. No formal report was filed with any regulatory body. No software review was conducted. The error code "Malfunction 54" was, in the machine's documentation, described as a "dose input 2" error, which conveyed almost nothing useful to the operator about what had actually gone wrong or how dangerous the situation might be.
On 26 July 1985, a patient at a treatment centre in Hamilton, Ontario, was undergoing electron beam therapy for cervical cancer. During treatment, the machine again malfunctioned. The patient received a massive overdose to her hip. She developed severe radiation injuries. Over the following months, her condition deteriorated. She died on 3 November 1985. Her death was not initially attributed to the Therac-25 overdose. The cancer she was being treated for provided a convenient alternative explanation.
On 7 December 1985, a patient at the same Hamilton facility was being treated for skin cancer of the face. The Therac-25 delivered a massive overdose to her face and eye. She lost vision in that eye and suffered severe facial disfigurement. AECL again investigated. Again, they could not reproduce the problem. Again, they attributed the malfunction to hardware issues. Again, the software was not seriously examined.
The error code "Malfunction 54" conveyed almost nothing useful to the operator about what had actually gone wrong or how dangerous the situation might be. The operator did exactly what she had been trained to do: she pressed proceed.
Based on Leveson & Turner, IEEE Computer, 1993
On 21 March 1986, a patient at the East Texas Cancer Center in Tyler, Texas, was scheduled for electron beam treatment. The operator, an experienced technician, set up the machine for electron mode. Then she noticed she had made an error in the settings and used the cursor keys to move back up the screen to correct it. She changed the mode from "x" (X-ray) to "e" (electron), corrected the energy level, and moved the cursor back down. The entire editing sequence took about eight seconds. She pressed the "set" button. The machine displayed "Treatment Paused" and then "Malfunction 54." The dose monitor showed a very low reading — essentially zero. The operator assumed the machine had delivered little or no dose. She pressed "P" to proceed. The machine fired again. Again, "Malfunction 54." Again, a near-zero dose reading.
The patient, a man named Ray Cox, felt as though he had been struck in the back. He described it later as feeling like someone had poured hot coffee on him. He tried to get up from the treatment table. The operator went into the treatment room. Cox told her the machine had burned him. But the machine showed no indication of any overdose — the dose monitor indicated he had received almost no radiation at all. The hospital contacted AECL. The company's response was consistent with its previous investigations: it was not possible for the Therac-25 to overdose a patient. The software would not allow it.
Ray Cox died five months later, on 20 September 1986. His death certificate listed the cause of death as an overdose of radiation. He was thirty-three years old.
Three weeks after Cox was overdosed, on 11 April 1986, another patient at the same Tyler facility was overdosed in exactly the same way. Verdon Kidd, a sixty-six-year-old man being treated for skin cancer, was struck by the beam while on the treatment table. He told the operator he could "see light" and "feel heat." He died on 1 May 1986. His death certificate also listed acute radiation overdose as the cause.
A sixth accident occurred on 17 January 1987, at the Yakima Valley Memorial Hospital in Yakima, Washington. A patient being treated for cancer received a massive overdose during electron beam therapy. He died three months later from complications consistent with radiation overexposure. In total, between June 1985 and January 1987, six patients were known to have been massively overdosed by the Therac-25. Three of them died. The others suffered devastating, permanent injuries.
The Race Condition
The Therac-25 control software ran on a DEC PDP-11/23 minicomputer. The software was written in PDP-11 assembly language by a single programmer who had also written the software for the Therac-6. Portions of the Therac-6 code were reused in the Therac-25, but the context in which they operated was fundamentally different. On the Therac-6, the software operated alongside hardware interlocks. If the software failed to detect a dangerous condition, the hardware interlocks would prevent harm. On the Therac-25, the software was the only protection. Code that had been adequate as one layer of defence in a multi-layered system was now the sole layer. It had been written for a context that no longer existed.
The software operated as a set of concurrent tasks managed by a simple real-time executive — essentially a minimal operating system. Different tasks handled different aspects of machine operation: reading the operator's console input, positioning the turntable and other beam-shaping components, monitoring dose delivery, and managing the treatment sequence. These tasks ran concurrently, sharing access to the same global variables, using flags and shared memory locations to coordinate their activities.
The Tyler accidents revealed what was happening. When the operator typed quickly — changing the mode from X-ray to electron within about eight seconds of the initial setup — the following sequence occurred. The treatment setup task would process the initial settings and begin configuring the machine for X-ray mode, including positioning the turntable to place the tungsten target in the beam path. When the operator edited the mode, the console input task would update the mode variable to electron. But the setup task had already read the mode variable and was partway through its configuration sequence. It had set certain internal flags based on the original X-ray mode selection. When the operator completed editing and pressed "set," the software checked the mode variable — which now said "electron" — and concluded that electron mode was selected. It configured the beam energy accordingly. But the turntable, which had started moving toward the X-ray position, had already been commanded based on the earlier reading of the mode variable. A flag used to track whether the turntable had been correctly set was manipulated by a counter that rolled over to zero every 256 passes through a particular code loop.
If the operator happened to press "set" at the exact moment the counter variable rolled over to zero — which happened once every 256 cycles, roughly every eight seconds — the software would bypass the turntable position check. The machine would fire the full-power electron beam with the turntable in the wrong position: no tungsten target, no flattening filter, no ion chamber. The beam would hit the patient directly. The dose monitor, which relied on the ion chamber that was no longer in the beam path, would read essentially zero — which is why the operator console showed almost no dose delivered, even though the patient had received thousands of rads in a fraction of a second.
This was the race condition. It required a specific sequence of actions performed within a specific time window. It required the operator to be fast — to enter the initial settings, notice an error, edit the mode, and press "set" within about eight seconds. Experienced operators, the kind who had been using the machine daily for months and had become quick and efficient with the console interface, were more likely to trigger it. Slower, less experienced operators were less likely to type fast enough. The machine punished expertise.
The race condition required the operator to be fast. Experienced operators — those who had used the machine daily for months — were more likely to trigger it. The machine punished expertise.
Based on Leveson & Turner analysis, 1993
What the Investigation Found
The definitive analysis of the Therac-25 accidents was conducted by Nancy Leveson, a professor of computer science at the University of California, Irvine (later MIT), and Clark Turner, a professor at California Polytechnic State University. Their 1993 paper, published in IEEE Computer, remains one of the most cited papers in software engineering history. It is also one of the most disturbing. What they found was not a single bug in an otherwise well-engineered system. They found a pattern of systemic failures in design, development, testing, management, and regulation that was far more alarming than any individual software defect.
The software had never been subjected to systematic testing against safety requirements. There were no formal specifications defining safe and unsafe states. Software hazard analysis had not been applied. No recognised standard for safety-critical medical device software existed at the time — and AECL had not attempted to create internal equivalents.
- The software was written by a single programmer in assembly language with no independent code review.
- No formal specification existed that defined the safety requirements the software had to satisfy.
- The software reused code from the Therac-6, where that code operated alongside hardware interlocks — those interlocks no longer existed.
- Concurrent tasks shared global variables with no synchronisation primitives (no mutexes, semaphores, or critical sections).
- Error messages displayed to operators were cryptic numeric codes that conveyed no information about severity or appropriate response.
- The software allowed operators to override error conditions and restart treatment with a single keystroke.
- AECL's testing consisted primarily of running the machine and observing whether it appeared to work — functional testing at the system level, with no unit testing, no boundary testing, and no stress testing of the software.
- The company's quality assurance process for the software was, by later accounts, essentially nonexistent.
Leveson and Turner identified a further, deeply significant problem: AECL's response to the early accidents. After each reported incident, the company investigated, failed to reproduce the problem, attributed it to transient hardware faults, and returned the machine to service. This pattern — investigate, fail to reproduce, blame hardware, clear the machine — repeated multiple times. The company's institutional assumption was that the software was reliable. This assumption was not based on evidence of testing or verification. It was based on the fact that the software had been running for some time without detected problems. The absence of evidence was treated as evidence of absence.
When the Tyler physicist, Fritz Hager, attempted to investigate the overdose of Ray Cox, AECL initially resisted the idea that the machine could have been at fault. Hager persisted. He set up the machine with a colleague and attempted to reproduce the accident by deliberately entering settings quickly and editing them in the same sequence the operator had used. On the sixth attempt, they succeeded. The machine fired. The dose monitor showed near-zero. They had reproduced the race condition. Hager reported his findings to the FDA. It was the physicist at a hospital in Texas, not the manufacturer in Canada, who identified the failure mechanism.
It was the physicist at a hospital in Texas, not the manufacturer in Canada, who identified the failure mechanism. Fritz Hager reproduced the race condition on his sixth attempt.
Leveson & Turner, IEEE Computer, July 1993
The FDA's involvement exposed further problems. At the time, the FDA's regulatory framework for medical devices was designed primarily around hardware. Software was not well understood by the regulatory apparatus, and the reporting requirements for device malfunctions were inconsistent. AECL had not reported the early accidents to the FDA in a timely manner. When reports were filed, they often minimised the severity of the incidents. The regulatory system, like the engineering process, was not equipped to deal with the failure modes that software introduced.
The Deeper Lesson: Systems, Not Bugs
The temptation, when reading about the Therac-25, is to focus on the race condition. It is a satisfying narrative: here is the bug, here is how it worked, here is how the operator triggered it. But Leveson's deeper point — the reason the Therac-25 paper has endured for over three decades as a teaching text — is that the race condition was a symptom. The disease was the system within which it existed.
Consider what had to be true for the race condition to kill a patient. The hardware interlocks had to have been removed. The software had to have a concurrency error. The error messages had to be uninformative. The operator interface had to allow treatment to resume after an error with a single keystroke. The dose monitoring had to fail when the turntable was in the wrong position. AECL had to fail to identify the problem after multiple reported incidents. The regulatory framework had to be inadequate for catching software defects in medical devices. Every one of these conditions was necessary. None of them, alone, would have been sufficient.
The removal of hardware interlocks was perhaps the most consequential decision. In safety engineering, defence in depth — multiple independent layers of protection — is fundamental. The Therac-6 and Therac-20 had it: software monitored the machine, and hardware interlocks independently prevented dangerous configurations. The Therac-25 collapsed these layers into one. When the software failed, there was nothing else. The assumption was that software, once written, is perfectly reliable. This assumption was wrong, and it was known to be wrong at the time. But it was convenient, and it was not challenged by a regulatory system that did not yet know how to evaluate software.
The operator interface compounded the problem. The Therac-25 displayed cryptic error codes that conveyed no information about severity. Operators encountered them frequently; most of the time, they were trivial. Operators developed the habit of pressing "P" to proceed automatically. When "Malfunction 54" appeared because of a race condition that had just delivered twenty-five thousand rads to a patient, the operator saw the same message she had seen dozens of times before for harmless reasons. Each press of "P" fired the beam again. The interface was training operators to ignore the very signals that were supposed to protect patients. This phenomenon — now well-documented as "alarm fatigue" — remains one of the most persistent failure modes in safety-critical human-machine interaction.
What Changed Because of It
The Therac-25 accidents did not merely prompt a recall of a defective machine. They catalysed the creation of an entirely new discipline. Before the Therac-25, safety engineering was primarily concerned with hardware. Software was treated as something different — logical, either correct or incorrect. The Therac-25 demonstrated that this distinction was false. Software could fail. Software could kill. And its failure modes were different from hardware in ways that demanded new methods of analysis.
Nancy Leveson's 1995 book, "Safeware: System Safety and Computers," became the foundational text of the software safety field. Her later work on STAMP — Systems-Theoretic Accident Model and Processes — developed a framework for understanding accidents in complex sociotechnical systems that moved beyond linear cause-and-effect models. The FDA fundamentally reformed its approach to medical device software, beginning in 1987 with new guidance and creating a dedicated software review unit. Medical devices containing software were now subject to scrutiny that, before the Therac-25, simply did not exist.
- IEC 62304, the international standard for medical device software lifecycle processes, was developed in direct response to the need identified by incidents like the Therac-25.
- The FDA's General Principles of Software Validation (2002) established formal requirements for verifying that medical device software performs as intended and does not introduce unacceptable risks.
- DO-178B (later DO-178C) for avionics software, while developed independently, was influenced by the same recognition that software in safety-critical systems required formal, evidence-based assurance methods.
- IEC 61508, the general standard for functional safety of electrical, electronic, and programmable electronic safety-related systems, incorporated principles of software safety integrity levels (SILs) that were informed by the failures the Therac-25 exemplified.
- The emerging field of formal methods — mathematical techniques for specifying, developing, and verifying software — received significant impetus from the recognition that testing alone was insufficient for safety-critical software.
The principle that hardware interlocks should not be removed simply because software performs the same function became a near-absolute rule in safety-critical design. Defence in depth was codified across industries. Software could monitor, control, and optimise — but it could not be the sole barrier against catastrophe. The American Association of Physicists in Medicine updated its protocols, and hospitals began treating medical device software as a component requiring independent verification.
The Weight of What Happened
It is important, at the end of an analysis like this, to remember what the Therac-25 story is actually about. It is about patients who went to a hospital to be treated for cancer and were instead injured or killed by the machine that was supposed to help them. Katie Yarborough, the woman injured at Kennestone, lived with permanent disability. The patients in Hamilton suffered grievously. Ray Cox and Verdon Kidd, in Tyler, died within weeks of each other. These were not abstract failures. They were people, in treatment rooms, who felt something go terribly wrong and were told — by operators, by technicians, by the manufacturer — that the machine could not have done what it had done.
AECL's response raises questions beyond engineering. The company's repeated insistence that the software could not be at fault, the delays in reporting to regulators, the attribution of patient injuries to hardware transients without evidence — these are failures of responsibility, not merely of engineering process.
The Therac-25 is now the most-taught case study in software engineering ethics worldwide. More than three decades after the last patient was overdosed, the lessons have not been fully learned. Systems are still deployed with insufficient testing. Software is still trusted beyond what the evidence supports. Warning systems still train operators to ignore them. The race condition in the Therac-25 was fixed. The deeper conditions that allowed it to kill patients — overconfidence in software, inadequate regulation, the removal of independent safety layers, the normalisation of deviance — remain present in systems being designed and deployed today.
The race condition in the Therac-25 was fixed. The deeper conditions that allowed it to kill patients — overconfidence in software, inadequate regulation, the removal of independent safety layers — remain present in systems being designed and deployed today.
After Leveson, Safeware, 1995
Sources
- Leveson, N. & Turner, C. — An Investigation of the Therac-25 Accidents (1993) — https://ieeexplore.ieee.org/document/274940
- Leveson, N. — Safeware: System Safety and Computers (1995) — https://mitpress.mit.edu/9780201119725/safeware/
- FDA — General Principles of Software Validation (2002) — https://www.fda.gov/regulatory-information/search-fda-guidance-documents/general-principles-software-validation
- IEC 62304:2006 — Medical Device Software Lifecycle Processes — https://www.iso.org/standard/38421.html
- Leveson, N. — Engineering a Safer World (2011) — https://mitpress.mit.edu/9780262016629/engineering-a-safer-world/
- Casey, S. — Set Phasers on Stun: And Other True Tales of Design, Technology, and Human Error (1993) — https://www.amazon.com/Set-Phasers-Stun-Design-Technology/dp/0963617885
- AECL — Therac-25 Incident Reports and Correspondence (1985-1987) — https://courses.cs.washington.edu/courses/cse403/16au/lectures/L14_Therac.pdf