The "Coarse Metric" Fallacy
Relying on a single MTTR number is like judging a factory's efficiency by its "average time to ship," ignoring the separate, critical stages of order processing, manufacturing, quality assurance, and logistics. We have been flying blind, unable to answer the most critical questions, including:
Was the delay caused by slow detection?
Did we struggle to identify the blast radius?
Did it take hours to find a clean, uncompromised backup?
Was the application team simply not ready to validate the restore?
Real-world disasters perfectly illustrate this problem. A 2024 incident impacting healthcare operations in the United States provided a devastating example. The public-facing metric was a "month-long outage," but this coarse number provides no actionable insight. The reality was a catastrophic breakdown across multiple, unmeasured phases. The company took systems offline "as a precaution," indicating a critical failure in their ability to quickly understand the scope of the breach. Even after data was technically restored, providers hesitated to reconnect due to a lack of trust, highlighting a massive failure in validating the recovery and restoring business confidence.
This problem is not unique to healthcare. Consider the 2019 ransomware attack on a major global manufacturing firm. The public metrics were "a $70M+ financial loss" and a "month-long disruption." Again, these are coarse metrics. The reality was a multi-stage failure: a forced shutdown of the entire global network (a failure in Scoping) and tens of thousands of employees forced to use pen and paper for weeks (a failure in Validation).
These "dark matter" metrics—the time spent in non-restore phases—are where true resilience is won or lost.