Many women are using AI for health information, but the answers aren’t always up to scratch Oscar Wong/Getty Images
Commonly used AI models fail to accurately diagnose or offer advice for many queries relating to women鈥檚 health that require urgent attention.
A group of 17 women鈥檚 health researchers, pharmacists and clinicians from the US and Europe drew up an initial list of 345 medical queries across five areas, including emergency medicine, gynaecology and neurology. These experts then reviewed the answers provided by a randomly chosen AI model for each question. Those that led to inaccurate responses were collated into a benchmarking test of AI models鈥 medical expertise that included 96 queries.
This test was then used to assess 13 large language models, produced by the likes of OpenAI, Google, Anthropic, Mistral AI and xAI. Across all the models, some 60 per cent of questions were answered in a way the human experts had previously said wasn鈥檛 sufficient for medical advice. GPT-5 performed best, failing on 47 per cent of queries, while Ministral 8B had the highest failure rate of 73 per cent.
鈥淚 saw more and more women in my own circle turning to AI tools for health questions and decision support,鈥 says team member at Lumos AI, a firm that helps companies evaluate and improve their own AI models. She and her colleagues recognised the risks of relying on a technology that inherits and amplifies existing gender gaps in medical knowledge. 鈥淭hat is what motivated us to build a first benchmark in this field,鈥 she says.
The rate of failure surprised Gruber. 鈥淲e expected some gaps, but what stood out was the degree of variation across models,鈥 she says.
Free newsletter
Sign up to The Daily
The latest on what鈥檚 new in science and why it matters each day.

The findings are unsurprising because of the way AI models are trained, based in human-generated historical data that has built-in biases, says at the University of Montreal, Canada. They point to 鈥渁 clear need for online health sources, as well as healthcare professional societies, to update their web content with more explicit sex and gender-related evidence-based information that AI can use to more accurately support women鈥檚 health鈥, she says.
at Stanford University in California says 60 per cent failure rate touted by the researchers behind the analysis is somewhat misleading. 鈥淚 wouldn鈥檛 hang on the 60 per cent number, since it was a limited and expert-designed sample,鈥 he says. 鈥淸It] wasn鈥檛 designed to be a broad sample or representative of what patients or doctors regularly would ask.鈥
Chen also points out that some of the scenarios that the model tests for are overly conservative, with high potential failure rates. For example, if postpartum women complain of a headache, the model suggests AI models fail if pre-eclampsia isn’t immediately suspected.
Gruber acknowledges and recognises those criticisms. 鈥淥ur goal was not to claim that models are broadly unsafe, but to define a clear, clinically grounded standard for evaluation,鈥 she says. 鈥淭he benchmark is intentionally conservative and on the stricter side in how it defines failures, because in healthcare, even seemingly minor omissions can matter depending on context.鈥
A spokesperson for OpenAI said: 鈥淐hatGPT is designed to support, not replace, medical care. We work closely with clinicians around the world to improve our models and run ongoing evaluations to reduce harmful or misleading responses. Our latest GPT 5.2 model is our strongest yet at considering important user context such as gender. We take the accuracy of model outputs seriously and while ChatGPT can provide helpful information, users should always rely on qualified clinicians for care and treatment decisions.鈥澛燭he other companies whose AIs were tested did not respond to New 女生小视频鈥檚 request for comment.
Reference:
arXiv
Topics:



