The rapid integration of Large Language Models (LLMs) into the global tech ecosystem has sparked a persistent debate over when, not if, artificial intelligence will begin diagnosing patients independently. However, a comprehensive study involving 21 of the world’s leading models—including OpenAI’s ChatGPT, Google’s Gemini, and China’s DeepSeek—suggests that the age of the autonomous AI physician remains a distant prospect. Researchers have found that while AI is remarkably adept at identifying a condition when presented with a complete set of facts, it falters significantly in the nuanced process of clinical reasoning.
Published in the journal JAMA Network Open, the study led by the MESH Incubator at Massachusetts General Hospital simulated real-world medical consultations using 29 distinct clinical cases. By providing information in stages—starting with basic symptoms and gradually adding laboratory results and imaging data—researchers were able to observe how models like Claude and Grok updated their hypotheses. The results revealed a striking paradox: when given a complete dossier, the models reached the correct final diagnosis in over 90% of cases, yet they failed to demonstrate the logical, step-by-step reasoning required to safely manage a patient from start to finish.
This discrepancy highlights a fundamental weakness in the current generation of generative AI. Clinical medicine is rarely a straight line from symptom to cure; it is a recursive process of ruling out possibilities and assessing risks. The study suggests that while LLMs possess vast medical knowledge, they lack the 'judgment' necessary to handle the ambiguity of early-stage diagnostics where information is often incomplete or misleading. This makes them powerful reference tools for human doctors but dangerous if left to operate without professional oversight.
For the tech industry, particularly in China where companies like DeepSeek are racing to apply AI to an overburdened healthcare system, these findings serve as a sobering reality check. The ability to pass a medical licensing exam or solve a textbook case does not equate to clinical competence. As the sector moves toward 'AI+Healthcare' integration, the focus is likely to shift from broad diagnostic accuracy to improving the interpretability and reliability of the internal logic these models use to reach their conclusions.
