Recent research has revealed some surprising truths about artificial intelligence in medicine. While many large language models (LLMs) have aced standardized medical exams, a new study from JAMA Network Open raises doubts about their actual reasoning abilities.
The study, led by Suhana Bedi, a PhD student at Stanford University, found that these AI systems often do not truly reason through clinical questions. Instead, they rely on familiar patterns in the questions and answers, leading to inaccuracies when those patterns are slightly adjusted. For instance, when the correct response was changed to “None of the other answers,” some AI models showed accuracy drops of over 30%.
Bedi points out, “High scores on benchmarks don’t reflect real clinical practice.” She notes that less than 5% of research has tested LLMs on real patient data, which is often messy and complex. To address this, the researchers created a new benchmark of 35 medical tasks verified by 30 clinicians. They wanted to see if LLMs could handle real-life scenarios that require deeper reasoning.
Among the tested models were some well-known names, like GPT-4o and Claude 3.5 Sonnet. The results were alarming. When familiar answer formats were altered, all models suffered significant declines in performance. Some dropped from 80% accuracy to just 42%. This suggests that while they may perform well on practice tests, these models struggle with unexpected variations.
Bedi emphasizes the importance of this research: “These AI systems aren’t as reliable as their test scores suggest.” If AI struggles with minor changes in questions, it may not handle the complexities of real patients who present with overlapping symptoms or unexpected complications.
In the world of medicine, we need AI that can truly assist healthcare providers, not just mimic answers. Bedi and her team advocate for developing evaluation tools that distinguish genuine reasoning from just recognizing patterns and argue for models prioritizing true reasoning capabilities.
In summary, while AI in medicine shows promise, the findings urge caution. The road ahead requires building systems that can navigate the unpredictable nature of healthcare. For now, AI should be seen as a supplemental tool for doctors, rather than a replacement.
The ongoing research seeks innovative approaches to ensure AI can meet the real-world challenges of medicine and provide reliable support to healthcare providers in making critical decisions.
Source link

