ChatGPT Health fails to spot 52% of medical emergencies in study

B&T Television

Derby driver jailed after gambling causes devastating A50 crash

February

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

more tags

ChatGPT Health fails to spot 52% of medical emergencies in study

Tags: social technology testing

Author: DATE POSTED:February 25, 2026

Feed: Dataconomy

View: Original article

ChatGPT Health fails to spot 52% of medical emergencies in study

A study published in Nature Medicine on February 24 found that ChatGPT Health failed to direct users to emergency care in more than half of serious medical cases. Researchers at the Icahn School of Medicine at Mount Sinai conducted the evaluation, testing the consumer-facing tool across 960 interactions. The study highlights potential safety concerns regarding AI-powered triage as millions of users increasingly rely on chatbots for health guidance.

The research team designed 60 clinical scenarios spanning 21 medical specialties. These cases ranged from minor conditions suitable for home care to genuine emergencies. Three independent physicians established the correct level of urgency for each scenario, utilizing guidelines from 56 medical societies. This consensus approach ensured a standardized benchmark for evaluating the AI’s performance. Each scenario was then tested under 16 different contextual conditions, including variations in race, gender, social dynamics, and barriers to care such as lack of insurance. This methodology produced a total of 960 interactions with ChatGPT Health.

The results revealed what the researchers described as an “inverted U-shaped” pattern of performance. ChatGPT Health handled textbook emergencies like stroke and anaphylaxis correctly. However, the tool under-triaged 52 percent of cases that physicians deemed true emergencies. For conditions such as diabetic ketoacidosis and impending respiratory failure, the AI directed patients toward a 24-to-48-hour evaluation instead of recommending immediate emergency department care. Additionally, the system misclassified 35 percent of non-urgent cases.

A significant finding concerned the tool’s susceptibility to anchoring bias. When family members or friends minimized symptoms within the prompts, triage recommendations shifted dramatically toward less urgent care. The study quantified this influence with an odds ratio of 11.7. Dr. Ashwin Ramaswamy, one of the study’s corresponding authors, commented on the specific limitations observed. “ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” Ramaswamy said. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most.”

The study also exposed inconsistencies in the tool’s crisis intervention system. ChatGPT Health is designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. Researchers found that these alerts appeared more reliably when users described no specific method of self-harm than when they articulated a concrete plan. This observation effectively inverted the relationship between risk level and safeguard activation. Dr. Girish Nadkarni, Mount Sinai’s Chief AI Officer and the study’s other corresponding author, described the finding as going “beyond inconsistency.” Nadkarni noted that “the system’s alerts were inverted relative to clinical risk.”

The study’s publication coincides with rapid consumer adoption of AI health tools. OpenAI launched ChatGPT Health in January 2026. The company reported that roughly 40 million people were using ChatGPT daily for health-related questions. Earlier in 2026, the nonprofit patient safety organization ECRI ranked misuse of AI chatbots in healthcare as the top health technology hazard. ECRI warned that these tools “can provide false or misleading information that could result in significant patient harm.”

The Mount Sinai team analyzed the influence of demographic and socioeconomic factors on triage outcomes. The data showed no statistically detectable effects from patient race, gender, or barriers to care. However, the study’s confidence intervals did not rule out the possibility of clinically meaningful differences. The researchers indicated plans to continue evaluating updated versions of ChatGPT Health and other consumer AI tools. Future research will expand into pediatric care, medication safety, and non-English-language use.

Featured image credit

Feed: Dataconomy

View: Original article

Tags: social technology testing