When a Sepsis Algorithm Scaled to Hundreds of Hospitals and Missed 67% of Cases
Major EHR Vendor
0.76-0.83
internal A U C
0.63
external A U C
33%
sensitivity
18%
false Alert Rate
The Challenge
Internal validation showed an AUC of 0.76-0.83, suggesting strong performance. However, external validation by independent researchers found an AUC of only 0.63 with just 33% sensitivity in real clinical settings.
The Approach
The algorithm was deployed at scale across hundreds of hospitals based on internal validation metrics. No site-specific calibration was performed. The assumption was that performance would generalize across diverse hospital environments.
The Results
Physicians had to evaluate 109 flagged patients to find just 1 requiring intervention. 67% of actual sepsis cases were missed entirely. Research showed that debiasing techniques only worked within the same hospital system where the model was trained.
Seven Pillar Insights
Scaling an algorithm across hundreds of hospitals without site-specific calibration turned a potentially useful tool into a dangerous source of alert fatigue and missed diagnoses.
In healthcare, the risk of false negatives (missing 67% of sepsis cases) is literally life-threatening. Scale amplifies model errors, not just model benefits.
Key Lessons
Algorithms validated internally fail when scaled across diverse environments
Site-specific calibration is mandatory at each deployment location
Alert fatigue from false positives can be as dangerous as missed cases
Related Case Studies
Ready to Avoid These Pitfalls?
Take the AI Leadership Assessment to identify your organization's strengths and vulnerabilities.
Want expert guidance on your AI strategy?
Schedule a consultation with Neil to explore how these lessons apply to your organization.
Schedule a Consultation