PV25 Schedule of Events

Beyond Black-Box AI: Assessment of Pathology-Specific Reasoning in Generative Models by Pathologists

Mon, Oct 6

01:35PM - 01:55PM PT

Seaport Ballroom F

Background: Diagnostic pathology relies on structured, multifaceted reasoning to interpret diverse clinical, histologic, and molecular data-a cognitive process that remains challenging to replicate algorithmically. The potential integration of large language models (LLMs) into medical domains such as pathology raises critical questions about their ability to perform clinically meaningful reasoning. Methods: In this study, we evaluated the performance of four advanced LLMs, ChatGPT-o1, ChatGPT-o3-mini, Gemini, and DeepSeek, on pathology board-style questions. Responses were independently assessed by expert pathologists based on both reasoning quality (accuracy, clinical relevance, coherence, depth, and conciseness) and diagnostic reasoning strategies commonly used in pathology practice. We conducted statistical analyses, including ANOVA and Tukey's HSD tests, to compare model performance across evaluation dimensions.Results: Gemini and DeepSeek demonstrated significantly superior overall reasoning quality compared to ChatGPT-o1 and ChatGPT-o3-mini (p < 0.05), excelling particularly in analytical depth and coherence. Gemini achieved the highest scores across Relevance (571), Coherence (548), Depth (542), Accuracy (546), and Conciseness (448) of reasoning contents, all out of a possible 600. This dominance extended across all assessed reasoning types, including Algorithmic Reasoning (501), Deductive Reasoning (447), Inductive Hypothetico-Deductive Reasoning (431), Mechanistic Insights (418), Probabilistic/Bayesian Reasoning (409), Pattern Recognition (403), and Heuristic Reasoning (377). DeepSeek consistently ranked second, while the ChatGPT models exhibited the lowest performance across all criteria. Notably, while all models showed comparable accuracy, only Gemini and DeepSeek consistently employed expert-like diagnostic reasoning strategies, such as inductive, algorithmic, and Bayesian approaches. Performance varied depending on the pathology subfield and reasoning type, with models generally performing better in algorithmic and deductive reasoning and struggling with heuristic reasoning and pattern recognition. Furthermore, the models providing more in-depth explanations (Gemini and DeepSeek) tended to be less concise. These limitations in model performance, coupled with inter-rater variability in expert scoring, underscore the necessity for further validation.Conclusion: These findings demonstrate that newer LLMs can approximate certain aspects of expert-level pathology reasoning, but limitations remain, particularly in contextual reasoning and trade-offs between analytical depth and conciseness. Addressing these challenges will be essential for the responsible development and clinical integration of AI tools in pathology.

Learning Objectives:

Realise the potential of Large Language Models (LLMs) to apply expert-level diagnostic reasoning in pathology; appreciate how their performance varies across different reasoning types, with strengths in algorithmic and deductive reasoning and notable challenges in heuristic reasoning and pattern recognition.
Understand which LLMs apply expert-like diagnostic strategies more consistently, know the trade-offs that exists between analytical depth and conciseness, which may impact the practical utility of these models in clinical settings.
Find opportunities for further research on enhancing contextual reasoning and addressing current limitations to ensure responsible and effective use of AI in diagnostic pathology.