What is DHCW's data centre? It is the physical infrastructure layer — operated by a third-party supplier — on which national NHS Wales clinical and administrative systems run. It failed in July 2024 (false fire alarm → cooling failover failure → 32 services offline for approximately six hours, three SLA breaches), and in a near-identical way in June 2025. The CEO called the 2025 recurrence 'a never event'; the Executive Director of Operations admitted to the board that 'we did have another incident like this last year' — both statements were erased from the published minutes. Twelve months later, the PSBA layer failed across all NHS Wales. Three consecutive years of major infrastructure failure at the layer beneath the national systems.
What this is
DHCW’s data centre is the physical infrastructure layer on which national NHS Wales clinical and administrative systems run. The facility is operated by a third-party supplier on contract to DHCW. When the data centre is unavailable, the national systems that depend on it are unavailable.
This page covers the documented incidents at the DHCW data centre layer. For the separate March 2026 network-layer outage at the PSBA tier — different supplier, different mechanism, same downstream effect — see PSBA.
July 2024 — first cooling failover failure
A false fire alarm triggered the data centre’s automatic fire-suppression sequence: cooling was switched from primary to backup so that the room could be sealed and fire-suppression gas discharged. On-site staff identified that there was no actual fire and prevented the gas discharge. The system then attempted to fail back from backup cooling to primary cooling. The failback did not work. Equipment in the facility powered itself off to prevent burnout from overheating.
- 32 services affected.
- Approximately 6 hours of outage.
- Three SLA breaches.
The incident was reported to the DHCW board in July 2024 as an amber KPI dip. No board member asked which services were affected, whether any were clinical, or whether patients were harmed. The recorded action was a future independent review of the data centre supplier’s maintenance regime.
June 2025 — near-identical recurrence
The same failure pattern recurred twelve months later. Same data centre. Same cooling-failover mechanism. Same downstream effect.
At the July 2025 board meeting, the Executive Director of Operations, Sam Lloyd, told the board: “we did have another incident like this last year.” The CEO, Helen Thomas, said:
“this should really be a never event in terms of the level of data centres that we commission. So there is that — there’s a lot of work for us to do with the data centre providers to ensure that they can, you know, give us reassurance so this is a never event and it will never happen again.”
Both statements were erased from the published minutes. The prior-incident acknowledgement was removed. The CEO’s “never event” framing — the strongest patient-safety language available to a chief executive — was removed. The published version of the meeting omits the fact that this had happened before, and omits the CEO’s own characterisation of the recurrence.
What survived in the minutes: a generic note that “an independent review of the data centre will be commissioned.” What was stripped: the causal chain (false fire alarm → cooling failback failure), the contractual-scrutiny dimension flagged by the Executive Director of Operations, and the fact that this was the second occurrence.
March 2026 — PSBA layer, same downstream signature
In March 2026, a different infrastructure layer — the PSBA network, contracted by Welsh Government, not DHCW — failed across all of NHS Wales. O365, EPMA, RISP, and radiology went offline simultaneously across every health board for several hours. See PSBA for the detail.
This was the third consecutive year of major infrastructure failure affecting NHS Wales digital services. Different layer. Different supplier contract. Same downstream signature: multiple national systems offline simultaneously, clinical services degraded across the country, no fallback architecture engaged.
Why this matters structurally
Three observations:
- The 2024 and 2025 incidents were near-identical. The corrective action between them was visibly inadequate: the same root-cause sequence produced the same failure twelve months apart. No remediation specific to the data centre supplier’s maintenance regime appears in any published assurance output between the two events.
- The “never event” erasure is the cleanest sanitisation example on record. A CEO deploying the strongest patient-safety language available, in the room, then having that language stripped from the published account — see L6: The Manufactured Narrative.
- The Performance and Delivery Committee was running its eighteen-month zero-corrective-actions window across the gap between the two incidents. The PDC generated no corrective action specific to the data centre supplier between July 2024 and June 2025, despite the operational risk being live and known — see L11: Captured Governance.
Where this is discussed in the diagnosis
- Drift to Low Performance — the 2024 and 2025 cooling failovers as the canonical drift example.
- L6: The Manufactured Narrative — the “never event” and prior-incident erasures.
- L11: Captured Governance — PDC failure to act between the two incidents.
- L5: The Vendor Dependency Spiral — supplier-dependency context.
- Intervention 1: Competent Leadership — patient-safety triage of live national infrastructure.
- PSBA — the March 2026 network-layer outage.