AIAAIC - Study: ChatGPT Health fails critical emergency and suicide tests

Study: ChatGPT Health fails critical emergency and suicide tests

Occurred: February 2026
Page published: February 2026

Report incident🔥| Improve page 💁| Access database 🔢

ChatGPT Health missed or under-triaged over half of serious medical emergencies and triggered suicide-crisis guidance unpredictably, raising safety concerns about its widespread use for urgent health advice.

What happened

Researchers at the Icahn School of Medicine at Mount Sinai evaluated ChatGPT Health, an AI-powered health guidance tool soft-launched in January 2026 and used by tens of millions daily for medical advice.

The study tested 60 clinician-authored clinical scenarios (spanning 21 specialties) across 16 contextual variations, generating 960 total interactions with the AI.

Results showed that ChatGPT Health under-triaged about 52% of cases physicians deemed to require emergency care (e.g., diabetic ketoacidosis or impending respiratory failure), sometimes recommending slower follow-up rather than immediate emergency department evaluation.

The system performed better on textbook emergencies like stroke but struggled in nuanced or edge-case scenarios where clinical judgment matters most.

For suicide-related interactions, researchers found that ChatGPT Health triggered crisis resources inconsistently. In some lower-risk scenarios it issued alerts, while in some higher-risk situations (including explicit plans for self-harm) it failed to direct users toward crisis support like the 988 Suicide & Crisis Lifeline.

Why it happened

The core cause lies in limitations of large language model reasoning in high-stakes health contexts. ChatGPT Health may recognise severe symptoms in its own explanations but still default to reassuring language or ambiguous recommendations instead of clear emergency guidance, indicating a gap between pattern recognition and clinical judgment.

Structural issues include anchoring bias (where symptom minimisation by friends or family in prompts leads the model to recommend less urgent care) and inconsistent activation of safety protocols.

The inconsistency in suicide-alert triggers suggests that the model’s safety layers are not reliably calibrated to severity, sometimes prioritising generic crisis advice over risk-proportionate responses.

This reflects broader transparency and accountability gaps in deploying LLMs for clinical decision-making without extensive external validation. There is no real-world feedback loop or mandatory external auditing before mass deployment.

What it means

For users: There is a severe risk of undertriage, where individuals may delay life-saving treatment for strokes, heart attacks, or mental health crises because an AI "validated" their decision to stay home.
For society: The "digital companion" nature of AI creates a false sense of security. As people become emotionally reliant on these tools, the impact of a single "hallucination" or failure becomes potentially fatal.
For policymakers: The study highlights the need for stricter regulation of AI as a medical device, with the ECRI and others calling for mandatory independent audits and "nutritional labels" for AI health advice.

System 🤖

ChatGPT Health

Developer: OpenAI
Country: USA
Sector: Health
Purpose: Provide acute medical guidance
Technology: Generative AI
Issue: Accountability; Accuracy/reliability; Safety; Transparency

Timeline ⏰

November 2022: ChatGPT is released to the public, quickly becoming a source for health queries.
2024–2025: Reports of "AI-mediated suicides" begin to emerge, involving chatbots providing methods or encouragement to vulnerable users.
October 2025: OpenAI estimates that over 1 million users per week express suicidal intent in chats. The company releases GPT-5 with "improved" safety benchmarks.
January 2026: OpenAI launches ChatGPT Health. The ECRI research non-profit names AI chatbot misuse the top health technology hazard of the year.
February 23, 2026: The Mount Sinai study is published.

Research, advocacy 🧮

ChatGPT Health performance in a structured test of triage recommendations

News, commentary, analysis 🗞️

Related 🌐

AIAAIC Repository ID: AIAAIC2222

Page updated

Google Sites

Report abuse