Comparative analysis between the performance of ChatGPT and medical students outside their examination phase

mibe000303 10.3205/mibe000303 urn:nbn:de:0183-mibe0003033 Research Article Comparative analysis between the performance of ChatGPT and medical students outside their examination phase Vergleichende Analyse der Leistungen von ChatGPT und Medizinstudenten außerhalb ihrer Prüfungsphase Leufkens Leufkens Daniel D

Institute of Medical Informatics, Department of Medicine, Justus Liebig University of Giessen, Germany

author Pons-Kühnemann Pons-Kühnemann Jörn J

Institute of Medical Informatics, Department of Medicine, Justus Liebig University of Giessen, Germany

author Schneider Schneider Henning H

Institute of Medical Informatics, Department of Medicine, Justus Liebig University of Giessen, Germany

author Windhorst Windhorst Anita C. AC Dr.

Institut für Medizinische Informatik, Abt. Medizinische Statistik, Rudolf-Buchheim-Str. 6, 35392 Gießen, Germany, Phone: +49 641 99 41366Institute of Medical Informatics, Department of Medicine, Justus Liebig University of Giessen, Germany

anita.c.windhorst@informatik.med.uni-giessen.de author German Medical Science GMS Publishing House

Düsseldorf

610 artificial intelligence AI large language models LLMs ChatGPT performance on medical examination comparative analysis experiment künstliche Intelligenz KI Large Language Models LLMs ChatGPT Leistung in medizinischen Prüfungen vergleichende Analyse Experiment 20260325 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). 1860-9171 22 GMS Medizinische Informatik, Biometrie und Epidemiologie GMS Med Inform Biom Epidemiol 05 Einleitung: Die staatlichen ärztlichen Zulassungsprüfungen in Deutschland sind sehr anspruchsvoll und erfordern eine umfangreiche Vorbereitung. Während medizinisches Fachwissen nach wie vor eine zentrale Rolle bei der Approbation spielt, suchen Patienten zunehmend Rat bei Modellen der künstlichen Intelligenz (KI) wie ChatGPT. Diese Entwicklung wirft die Frage auf, ob KI in der Lage ist, grundlegendes medizinisches Wissen korrekt zu vermitteln. Frühere internationale Studien (z.B. in China, den USA, Polen und dem Vereinigten Königreich) haben die Leistung von KI mit der von zertifizierten Fachleuten verglichen, häufig durch indirekte Vergleiche anhand historischer Prüfungsdurchschnitte. Es gibt jedoch keine Studien, in denen die Leistungen von KI direkt mit denen von Medizinstudenten verglichen wurden, die außerhalb ihrer Prüfungsvorbereitungsphase beurteilt wurden, was einen Einblick in das Abrufen von Wissen geben könnte. Ziel dieser Studie ist es, die Leistung von großen Sprachmodellen (LLMs) direkt mit der von Medizinstudenten außerhalb ihrer Prüfungsphase zu vergleichen.Methoden: An einer deutschen medizinischen Hochschule wurde eine anonymisierte Umfrage unter Studierenden in der klinischen Phase ihres Studiums durchgeführt (in der Regel 170 bis 180 Studierende). Die Teilnehmer beantworteten 10 Single-Choice-Fragen, die nach dem Zufallsprinzip aus einem vorgefilterten Pool von Aufgaben aus dem 1. A</PlainText></TextGroup>bschnitt der in Deutschland zentral durchgeführten Ärztlichen Prüfung (M1) ausgewählt wurden. Die Fragen wurden nach klinischer Relevanz, mittlerem Schwierigkeitsgrad und Ausschluss von chemisch-mathematischen Inhalten ausgewählt. Die gleichen Fragen wurden von ChatGPT-3.5, ChatGPT-4 und ChatGPT-4 mini beantwortet. Die Leistung wurde anhand der Anzahl der richtigen Antworten verglichen. Zusätzlich wurde der korrigierte Trennschärfe-Koeffizient berechnet. </Pgraph><Pgraph><Mark1>Ergebnisse:</Mark1> Von den 143 Teilnehmern (Durchschnittsalter 22) befanden sich 129 im 5. Semester, der Rest in späteren Semestern. Ca. 40% der Studierenden identifizierten sich als männlich, 55% als weiblich. Die Studierenden beantworteten im Median 7 von 10 Fragen richtig (Spanne: 1–10). Alle KI-Modelle beantworteten 9 von 10 Fragen richtig. Die einzige Frage, die von der KI falsch beantwortet wurde, wurde von 35% der Studierenden (50/143) richtig beantwortet und wies den zweithöchsten Trennschärfe-Koeffizienten (0,28) auf, was darauf hindeutet, dass sie die Leistungen der Studierenden effektiv differenziert.</Pgraph><Pgraph><Mark1>Diskussion:</Mark1> Die LLMs schnitten besser ab als Medizinstudenten. Zu den Einschränkungen gehören jedoch die eher geringe Stichprobengröße und die Vorauswahl der Fragen nach spezifischen Kriterien (klinische Relevanz, mittlerer Schwierigkeitsgrad, Ausschluss chemisch-mathematischer Inhalte), was einen Selektionsbias einführt und die Repräsentativität unserer Befunde für die gesamte Prüfung einschränkt. Wichtig ist, dass die durch die LLMs falsch beantworte Frage eine, verglichen mit den anderen Fragen, hohe Trennschärfe aufwies, was auf eine mögliche Lücke im KI-Verständnis für nuancierte oder komplexe Inhalte hinweist. Diese Ergebnisse deuten darauf hin, dass KI-Modelle nicht uneingeschränkt einsetzbar sind und durch menschliche Aufsicht ergänzt werden sollten, insbesondere in anspruchsvollen oder mehrdeu<TextGroup><PlainText>t</PlainText></TextGroup>igen klinischen Kontexten.</Pgraph><Pgraph><Mark1>Schlussfolgerung:</Mark1> KI-Modelle übertreffen Studierende außerhalb der aktiven Prüfungsvorbereitung. Ihre gelegentlichen Fehler, insbesondere bei Fragen mit hoher Trennschärfe, machen jedoch deutlich, dass Vorsicht geboten ist. Weitere Forschung ist notwendig, um den Nutzen von KI in der realen medizinischen Ausbildung und im Studium zu bewerten.</Pgraph></Abstract> <Abstract language="en" linked="yes"><Pgraph><Mark1>Introduction:</Mark1> State-run medical licensing examinations in Germany are highly demanding, requiring extensive preparation. While medical expertise remains central to licensure, patients increasingly seek advice from artificial intelligence (AI) models like ChatGPT. This shift raises the question of whether AI can accurately convey essential medical knowledge. Previous international studies (e.g., China, US, Poland, UK) have compared AI performance with that of certified professionals, often through indirect comparisons using historical exam averages. However, no studies have directly compared AI to medical students assessed outside their exam preparation phase, which could provide insight into knowledge retention. This study aims to directly compare the performance of large language models (LLMs) with that of medical students beyond their examination phase.</Pgraph><Pgraph><Mark1>Methods:</Mark1> An anonymized survey was conducted at a German medical school among students in the clinical stage of their studies (typically 170 to 180 students). Participants answered 10 single-choice questions randomly selected from a pre-filtered pool derived from past German preclinical medical exam (M1) items. Questions were selected based on clinical relevance, moderate difficulty, and exclusion of chemical/mathematical content. The same questions were answered by ChatGPT-3.5, ChatGPT-4, and ChatGPT-4 mini. Performance was compared in terms of correct responses. Additionally, the corrected discrimination coefficient was calculated for each item, measuring how well each question differentiated between higher and lower performers.</Pgraph><Pgraph><Mark1>Results:</Mark1> Of the 143 participants (median age 22), 129 were in the <TextGroup><PlainText>5</PlainText><Superscript>th</Superscript><PlainText> s</PlainText></TextGroup>emester, and the rest were in later semesters. About 40% identified as male and 55% as female. Students answered a median of 7 out of 10 questions correctly (range: 1–10). All AI models answered 9 out of 10 questions correctly. The only question missed by AI was answered correctly by 35% of students (50/143) and had the second-highest discrimination coefficient (0.28), indicating it effectively differentiated student performance.</Pgraph><Pgraph><Mark1>Discussion:</Mark1> LLMs outperformed medical students who were beyond their exam preparation phase. However, limitations include the modest sample size and preselection of questions based on specific criteria (clinical relevance, moderate difficulty, exclusion of chemical/mathematical content), which introduces selection bias and limits the representativeness of our findings for the full examination. Importantly, the AI’s one incorrect answer had the second highest, although still marginal discrimination coefficient, highlighting a possible gap in AI understanding for nuanced or complex content. These findings suggest that AI models are not without limitations and should be supplemented by human oversight, particularly in high-stakes or ambiguous clinical contexts.</Pgraph><Pgraph><Mark1>Conclusion:</Mark1> AI models demonstrate strong performance in answering single choice questions, surpassing students outside active exam preparation. However, their occasional errors, especially on discriminative questions, underline the need for caution. Further research is necessary to evaluate AI utility in real-world medical education and clinical decision-making, ensuring ethical and responsible integration.</Pgraph></Abstract> <TextBlock name="Introduction" linked="yes"> <MainHeadline>Introduction</MainHeadline><Pgraph>State licensing examinations in medicine are highly demanding both in terms of subject matter and content, requiring months of preparation from candidates. Despite the rigorous medical expertise required for medical licensure maintaining its excellent reputation, patients are increasingly seeking medical advice from artificial intelligence (AI) with large language models such as ChatGPT. This paradigm shift raises fundamental questions about whether AI systems are capable of conveying necessary medical knowledge appropriately, both in terms of subject matter and content.</Pgraph><Pgraph>Consequently, several studies have compared results achieved by AI systems with those of certified medical examiners across various countries. Research has been conducted in China <TextLink reference="1"></TextLink>, the United States <TextLink reference="2"></TextLink>, <TextLink reference="3"></TextLink>, Poland <TextLink reference="4"></TextLink>, and the United Kingdom <TextLink reference="5"></TextLink>. Multiple systematic reviews have synthesized these findings, with Brin et al. <TextLink reference="3"></TextLink> reporting 80-90% accuracy for large language models on medical examinations, Liu et al. <TextLink reference="6"></TextLink> analyzing ChatGPT performance across different versions worldwide, and Jin et al. <TextLink reference="7"></TextLink> demonstrating an overall effect size of 70.1%, with 69.1% only in the field of medicine, in their meta-analysis. These studies typically conduct indirect comparisons with examination averages from corresponding examination years. However, direct comparison between AI performance and medical students assessed with temporal distance from their examination phase has not yet been investigated.</Pgraph><Pgraph>Recent studies have extended beyond basic licensing examinations to examine AI performance in specialized medical domains. Longwell et al. <TextLink reference="8"></TextLink> evaluated large language models on medical oncology examination questions, finding 85% accuracy, though they noted that 81.8% of incorrect answers could lead to patient harm. Similarly, Tarabanis et al. <TextLink reference="9"></TextLink> tested publicly available large language models on internal medicine board-style questions. These specialized assessments highlight the complexity of medical knowledge evaluation and the potential risks associated with AI-generated medical advice. While AI systems demonstrate consistent performance patterns without temporal degradation, human learners experience natural forgetting curves that affect knowledge retention over time <TextLink reference="10"></TextLink>. E.g., in medical students, knowledge retention rates drop to around 53% to 70% in the area of physiology in the span of an average of 16 weeks <TextLink reference="11"></TextLink>. Knowledge retention significantly depends on the learning method (e.g., active retrieval practice vs. passive review) <TextLink reference="12"></TextLink>, <TextLink reference="13"></TextLink>, the practical application of learned material in clinical or simulation contexts <TextLink reference="14"></TextLink>, the number of <TextGroup><PlainText>repetitions</PlainText></TextGroup> and testing frequency including overlearning <TextLink reference="15"></TextLink>, <TextLink reference="16"></TextLink>, and the spacing or timing between learning episodes and examinations (spacing effect) <TextLink reference="13"></TextLink>, <TextLink reference="17"></TextLink>. Medical students who have completed their state examinations may demonstrate different performance patterns when assessed outside their active preparation phase, potentially providing insights into the practical implications of knowledge retention in medical practice. This distinction is particularly relevant when considering the real-world application of medical knowledge, where practitioners must recall information learned during their training years later in clinical practice.</Pgraph><Pgraph>Current research approaches have focused primarily on testing the maximum learning level of medical students during their examination preparation phase, rather than examining their later recall of previously learned material in comparison with AI systems. This methodological limitation restricts our understanding of how AI performance compares to the practical reality of medical knowledge application in clinical settings. Importantly, according to Miller’s framework <TextLink reference="18"></TextLink> of clinical competence, factual knowledge (“knows/knows how”) represents only the foundational levels of competence, whereas clinical decision-making requires higher levels of performance in simulated or real contexts (“shows how/does”) <TextLink reference="18"></TextLink>, <TextLink reference="19"></TextLink>. Thus, the present study specifically targets the comparison of pre-clinical factual knowledge rather than clinical reasoning or decision-making skills.</Pgraph><Pgraph>Therefore, the aim of this study is to establish a direct comparison of pre-clinical knowledge levels between AI systems and medical students assessed outside their examination phase, using questions familiar to students from their previous state examination experience. This approach should provide a more nuanced understanding of AI capabilities in pre-clinical knowledge compared to traditional indirect comparisons with historical examination averages.</Pgraph></TextBlock> <TextBlock name="Methods" linked="yes"> <MainHeadline>Methods</MainHeadline><Pgraph>An anonymized survey was conducted among medical students at the medical faculty of the Justus-Liebig-Uni<TextGroup><PlainText>v</PlainText></TextGroup>ersity Gießen who were already in the clinical stage of their studies. In a typical semester around 170–18<TextGroup><PlainText>0 s</PlainText></TextGroup>tudents are enrolled. The survey required medical students to answer a random selection of single-choice examination questions familiar to them from their own state examination from the previous year in 2024 (first section of the German medical examination “M1”). Questions were selected by the authors from a pre-filtered pool based on clinical relevance (reference to diagnostics: Q01, 10, case studies: Q03, 07, 09 or diseases: Q02, 04, 05, 06, 08), moderate difficulty (according to Institut für medizinische und pharmazeutische Prüfungsfragen (IMPP) <TextLink reference="20"></TextLink>, the average correct answer rate for the set of questions Q01–10 was 66.8% [Min. correct answer rate 27% for Q03 and max. 90% for Q06]), and exclusion of chemical and mathematical content to focus on clinically applicable medical knowledge, which has been taught at least up to the M1 level (see Table 1 <ImgLink imgNo="1" imgType="table" /> for complete question set with translations). Students were surveyed during their clinical phase, creating temporal distance from their active examination preparation period.</Pgraph><Pgraph>The same questions were answered by ChatGPT-3.5, ChatGPT-4, and ChatGPT-4 mini to enable direct performance comparison.</Pgraph><SubHeadline>Statistical analysis</SubHeadline><Pgraph>Performance was evaluated by comparing correct response rates between AI models and medical students. A binomial test was used to determine if students answered randomly, success rate is then at 20% (one in five answers). Additionally, the corrected discrimination coefficient <TextLink reference="21"></TextLink> was calculated for each question to measure how effectively each item differentiated between higher and lower-performing students, providing insight into question quality and the nature of knowledge being assessed. The corrected discrimination coefficient is a point-biserial correlation that quantifies the relationship between correctness on a single item and the overall test score. Based on established psychometric standards, we interpret the discrimination coefficients as follows: values r≥0.4 indicate very good discrimination, 0.3≤r<0.4 reasonably good discrimination, 0.2≤ r<0.3 marginal discrimination, and r<0.2 poor discrimination. Question 9 achieved a coefficient of r=0.28, placing it in the marginal discrimination category <TextLink reference="22"></TextLink>.</Pgraph><Pgraph>Statistical analysis was performed using R version 4.5.1 <TextLink reference="23"></TextLink>.</Pgraph></TextBlock> <TextBlock name="Results" linked="yes"> <MainHeadline>Results</MainHeadline><SubHeadline>Demographics of student cohort</SubHeadline><Pgraph>A total of 143 students participated in the study with a median age of 22 years (mean = 23.04, SD=5.16, range: 19–69). The majority of participants were in the <TextGroup><PlainText>5</PlainText><Superscript>th</Superscript><PlainText> s</PlainText></TextGroup>emester (n=129, 90.2%) of the German medical studies system, having completed their first state-run examination at the end of the previous semester. The remaining participants were in later semesters (6<Superscript>th</Superscript>–23<Superscript>rd</Superscript> semester, mean = 5.30, SD=1.65).</Pgraph><Pgraph>Gender distribution was as follows: 57 participants (39.9%) identified as male, 79 participants (55.2%) as female, 2 participants as diverse, and 5 participants did not specify their gender.</Pgraph><SubHeadline>Overall performance results</SubHeadline><Pgraph>Students achieved a median score of 7 out of 10 questions correctly (mean = 6.77, SD=1.78, range: 1–10, interquartile range: 6–8), which is statistically different from randomly selecting one of five possible answers (probability for success: 0.7, 95% CI: 0.35–0.93, p<0.001). Students in the 5<Superscript>th</Superscript> semester showed a median of 7 of 10 correct answers (mean = 6.83, SD=1.74, range: 1–10, interquartile interval: 6–8), and students in higher semesters (n=13, 7 in the 6<Superscript>th</Superscript>, 2 in the 8<Superscript>th</Superscript>, and one in the 7<Superscript>th</Superscript>, 9<Superscript>th</Superscript>, 10<Superscript>th</Superscript>, and 23<Superscript>rd</Superscript> semester) answered a median of 6 out of 10 questions correctly (mean = 6.15, SD=2.15, range: 2–9, interquartile range: 5–7). Difference in total points were not statistically significant (Wilcoxon rank sum test, p=0.243).</Pgraph><Pgraph>The distribution showed a slight negative skew (–0.67), indicating that most students performed above the mean, with relatively few scoring very low (Figure 1 <ImgLink imgNo="1" imgType="figure" />). In contrast, all three LLM models (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4 mini) achieved identical performance, answering 9 out of 10 questions correctly (90% accuracy).</Pgraph><SubHeadline>Question difficulty and discrimination analysis</SubHeadline><Pgraph>The analysis of individual questions revealed considerable variation in difficulty and discriminative power (Table 2 <ImgLink imgNo="2" imgType="table" />). The questions ranged from very easy (Q01: 91.6% correct) to very difficult (Q03: 25.9% correct). Notably, the question that all LLMs answered incorrectly (Q09) had moderate difficulty (35% student success rate) and showed the second highest, though still marginal discrimination (0.28), indicating that it was one of the more effective questions at differentiating between higher and lower-performing students.</Pgraph><SubHeadline>Analysis of AI failure case</SubHeadline><Pgraph>The question that all LLMs failed to answer correctly comes from the case-related questions (Q03, 07, 09) and, according to IMPP <TextLink reference="20"></TextLink>, was answered correctly by 36% of M1 participants in 2024. It is important to note that item difficulty and discriminatory power represent distinct psychometric properties. The low proportion of correct responses (36%) reflects difficulty, while the discrimination coefficient of 0.28 reflects adequate discriminatory power, indicating that students with stronger overall performance were more likely to answer this question correctly. Thus, despite being challenging, this item effectively differentiated between performance levels, which is a desirable property for assessment items. This question required understanding of complex renal physiology and the relationship between systemic hypertension and renal hemodynamics, particularly the differential effects on cortical versus medullary blood flow:</Pgraph><Pgraph><Mark1>Question Q09 (German):</Mark1> “Bei einer 53-jährigen Patientin wird ein systemarterieller Blutdruck von 150/95 mmHg gemessen. Dieser erhöhte Blutdruck kann zu einer deutlichen Steigerung des Harnminutenvolumens im Vergleich zum Harnminutenvolumen bei normwertigem Blutdruck führen. Wodurch wird das Harnminutenvolumen in dieser Situation am ehesten gesteigert?”</Pgraph><Pgraph>A – Abfall des kolloidosmotischen Drucks im Vas afferens der Niere<LineBreak></LineBreak>B – Anstieg der GFR in den kortikalen Glomeruli<LineBreak></LineBreak>C – gesteigerte Durchblutung des Nierenmarks <TextGroup><PlainText>[correct]</PlainText></TextGroup><LineBreak></LineBreak>D – verstärkte Sekretion von Aldosteron ins Blutplasma<LineBreak></LineBreak>E – verstärkte Sekretion von Harnstoff im kortikalen Sammelrohr</Pgraph><Pgraph><Mark1>Translation:</Mark1> “A 53-year-old patient has a systemic arterial blood pressure of 150/95 mmHg measured. This elevated blood pressure can lead to a significant increase in urine minute volume compared to urine minute volume at normal blood pressure. How is the urine minute volume most likely increased in this situation?”</Pgraph><Pgraph>A – Decrease in colloid osmotic pressure in the afferent arteriole of the kidney<LineBreak></LineBreak>B – Increase in GFR in the cortical glomeruli<LineBreak></LineBreak>C – Increased blood flow to the renal medulla [correct]<LineBreak></LineBreak>D – Increased secretion of aldosterone into the blood plasma<LineBreak></LineBreak>E – Increased secretion of urea in the cortical collecting duct</Pgraph><Pgraph><UnorderedList><ListItem level="1">AI answer: B – Anstieg der GFR in den kortikalen Glomeruli (Increase in GFR in cortical glomeruli), 64/143 of students also gave this answer, which is the majority of students for this question </ListItem><ListItem level="1">Correct answer: C – Gesteigerte Durchblutung des Nierenmarks (Increased blood flow to the renal medulla), student success rate: 50/143 (35.0%)</ListItem></UnorderedList></Pgraph></TextBlock> <TextBlock name="Discussion" linked="yes"> <MainHeadline>Discussion</MainHeadline><SubHeadline>Comparison with existing literature</SubHeadline><Pgraph>The present study contributes to the growing body of literature examining AI performance in medical examinations by providing a direct comparison between AI systems and medical students assessed outside their examination phase. Previous research has established that AI language models can achieve substantial accuracy on medical licensing examinations. Our findings align with recent systematic reviews, where Brin et al. <TextLink reference="3"></TextLink> reported 80-90% accuracy for large language models on medical examinations, and Jin et al. <TextLink reference="7"></TextLink> demonstrated an overall effect size of 70.1% across multiple studies. The 90% median accuracy observed for LLMs in our study falls within the upper range of these reported values.</Pgraph><Pgraph>Country-specific studies have similarly demonstrated strong AI performance across diverse medical examination systems. The research conducted has shown that AI systems can achieve results comparable to or even superior to examination averages <TextLink reference="1"></TextLink>, <TextLink reference="2"></TextLink>, <TextLink reference="4"></TextLink>. However, these studies predominantly rely on indirect comparisons with historical examination data rather than direct contemporaneous assessment. Our study addresses this methodological limitation by providing a direct comparison between AI and medical students, offering a more nuanced understanding of AI’s capabilities in answering single-choice questions.</Pgraph><SubHeadline>Performance analysis</SubHeadline><Pgraph>The results indicate that LLMs achieved a median accuracy of 90%, outperforming medical students who were assessed after their exam preparation period and scored a median of 67.7%. Several factors must be considered when interpreting this difference. Students assessed outside of their active study period may exhibit reduced performance due to established forgetting curves associated with long-term memory decay <TextLink reference="10"></TextLink>. In contrast, LLMs demonstrate consistent performance patterns without temporal degradation. The time interval of approximately 16 weeks since active preparation likely influenced student performance, as the retention of medical knowledge is known to decline in the absence of repeated reinforcement. The study we conducted took place approximately 26 weeks after the medical examination (October 2024 to April 2025). This interpretation aligns with previous findings; for example, a study reported a decrease in student accuracy from 70.4% to 53.5% in the topic of physiology over a similar duration <TextLink reference="11"></TextLink>.</Pgraph><Pgraph>The single question that challenged all AI models (Q09) had moderate difficulty for students (35% success rate) and although marginal, still the second highest discrimination (0.28), suggesting it tested nuanced understanding rather than factual recall. This finding indicates that the question differentiated relatively good between higher and lower-performing students while simultaneously exposing limitations in AI comprehension of complex physiological concepts.</Pgraph><Pgraph>While students showed variable performance across questions, AI models made a consistent error on the same complex physiology question, indicating potential systematic gaps in understanding rather than random errors typical of human performance.</Pgraph><SubHeadline>Implications for medical education</SubHeadline><Pgraph>The findings have several implications for medical education and clinical practice. AI models could serve as valuable resources for knowledge reinforcement and self-assessment, providing students with immediate feedback and comprehensive coverage of medical topics. The discriminative question that AI failed may represent areas where human clinical reasoning surpasses current AI capabilities, highlighting the complementary nature of human and artificial intelligence in medical contexts.</Pgraph><Pgraph>Furthermore, AI performance could help identify questions that effectively differentiate student competency levels, potentially informing assessment design and curriculum development. The consistent AI performance across different model versions suggests reliable knowledge base consistency, which could be valuable for standardized educational applications. However, this must be balanced against the risks identified in specialized medical domains. Longwell et al. <TextLink reference="8"></TextLink> found that while AI achieved 85% accuracy on medical oncology examination questions, 81.8% of incorrect answers could lead to patient harm, indicating a significant risk profile for AI-generated medical advice.</Pgraph><Pgraph>The systematic nature of AI errors, as demonstrated by the consistent failure on complex physiology questions, suggests that AI limitations may be predictable and identifiable. This predictability could inform the development of hybrid educational approaches that leverage AI strengths while addressing its systematic weaknesses through human expertise.</Pgraph><SubHeadline>Limitations and future directions</SubHeadline><Pgraph>Several limitations should be acknowledged when interpreting these results. The study included 143 students from a single institution, which may limit the generalizability of findings to other medical schools or educational systems. The restricted set of 10 preselected questions constitutes the main limitation of this study. While items were chosen to represent clinically relevant, moderately difficult pre-clinical content, this purposive sampling strategy introduces selection bias. The deliberate exclusion of chemical/mathematical content and the focus on moderate difficulty levels mean that our findings cannot be generalized to the full spectrum of medical knowledge or difficulty ranges assessed in licensing examinations. Consequently, the small number of questions prevents a systematic error analysis of where and why AI models fail. Our findings should therefore be interpreted as an exploratory pilot comparison rather than a definitive benchmark. Future studies should include larger and more diverse item pools covering multiple content domains and cognitive levels, which would enable more granular error analysis and increase the generalizability of results <TextLink reference="19"></TextLink>.</Pgraph><Pgraph>Moreover, the temporal distance between examination preparation and assessment may have differentially affected individual students, and the specific question selection may have influenced the observed performance gap. Additionally, the study focused on multiple-choice questions from a single examination system, which may limit generalizability to other assessment formats or medical education systems. While our study focused on factual knowledge, the evaluation of clinical decision-making requires a wide range of assessment approaches that extend beyond single-best-answer formats. Daniel et al. <TextLink reference="19"></TextLink> have shown that multiple-choice questions can, under certain conditions, be applied to assess aspects of clinical reasoning such as leading diagnosis and treatment decisions; however, such question types were not included in our item set. Therefore, no inference regarding clinical competence or decision-making skills should be drawn from our results.</Pgraph><Pgraph>Future research should examine the stability of AI performance across different medical specialties and question formats, as well as investigate the factors that contribute to knowledge retrieval in both AI systems and human learners. Particular attention should be paid to identifying systematic patterns in AI errors and developing methods to address these limitations. The development of more sophisticated evaluation frameworks that account for the temporal dimension of medical knowledge retention would provide valuable insights for medical education and AI development.</Pgraph></TextBlock> <TextBlock name="Conclusion" linked="yes"> <MainHeadline>Conclusion</MainHeadline><Pgraph>This study provides the first direct comparison between AI performance and medical students assessed outside their examination phase, revealing a significant performance advantage for AI systems. While these findings suggest potential applications for AI in medical education, they must be interpreted within the context of the broader literature highlighting both the capabilities and limitations of AI in medical contexts. The superiority demonstrated by AI systems, combined with the identified risks in specialized medical domains, emphasizes the need for careful consideration of AI implementation in medical education and practice. </Pgraph></TextBlock> <TextBlock name="Notes" linked="yes"> <MainHeadline>Notes</MainHeadline><SubHeadline>Author contributions</SubHeadline><Pgraph><UnorderedList><ListItem level="1">Study conception: all authors</ListItem><ListItem level="1">Study realization: DL, ACW, JPK</ListItem><ListItem level="1">Drafting the manuscript: ACW, DL</ListItem><ListItem level="1">Revising the manuscript: all authors</ListItem><ListItem level="1">Data analysis: ACW, DL</ListItem></UnorderedList></Pgraph><SubHeadline>Authors’ ORCIDs</SubHeadline><Pgraph><UnorderedList><ListItem level="1">Daniel Leufkens: <Hyperlink href="https://orcid.org/0000-0002-9729-2905">0000-0002-9729-2905</Hyperlink></ListItem><ListItem level="1">Jörn Pons-Kühnemann: <Hyperlink href="https://orcid.org/0000-0002-8211-4399">0000-0002-8211-4399</Hyperlink></ListItem><ListItem level="1">Henning Schneider: <Hyperlink href="https://orcid.org/0000-0002-9958-4434">0000-0002-9958-4434</Hyperlink></ListItem><ListItem level="1">Anita C. Windhorst: <Hyperlink href="https://orcid.org/0000-0002-7357-2080">0000-0002-7357-2080</Hyperlink></ListItem></UnorderedList></Pgraph><SubHeadline>Competing interests</SubHeadline><Pgraph>The authors declare that they have no competing interests.</Pgraph></TextBlock> <References linked="yes"> <Reference refNo="1"> <RefAuthor>Zong H</RefAuthor> <RefAuthor>Li J</RefAuthor> <RefAuthor>Wu E</RefAuthor> <RefAuthor>Wu R</RefAuthor> <RefAuthor>Lu J</RefAuthor> <RefAuthor>Shen B</RefAuthor> <RefTitle>Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses</RefTitle> <RefYear>2024</RefYear> <RefJournal>BMC Med Educ</RefJournal> <RefPage>143</RefPage> <RefTotal>Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ. 2024 Feb 14;24(1):143. DOI: 10.1186/s12909-024-05125-7</RefTotal> <RefLink>https://doi.org/10.1186/s12909-024-05125-7</RefLink> </Reference> <Reference refNo="2"> <RefAuthor>Gilson A</RefAuthor> <RefAuthor>Safranek CW</RefAuthor> <RefAuthor>Huang T</RefAuthor> <RefAuthor>Socrates V</RefAuthor> <RefAuthor>Chi L</RefAuthor> <RefAuthor>Taylor RA</RefAuthor> <RefAuthor>Chartash D</RefAuthor> <RefTitle>How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment</RefTitle> <RefYear>2023</RefYear> <RefJournal>JMIR Med Educ</RefJournal> <RefPage>e45312</RefPage> <RefTotal>Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023 Feb 8;9:e45312. DOI: 10.2196/45312</RefTotal> <RefLink>https://doi.org/10.2196/45312</RefLink> </Reference> <Reference refNo="3"> <RefAuthor>Brin D</RefAuthor> <RefAuthor>Sorin V</RefAuthor> <RefAuthor>Konen E</RefAuthor> <RefAuthor>Nadkarni G</RefAuthor> <RefAuthor>Glicksberg BS</RefAuthor> <RefAuthor>Klang E</RefAuthor> <RefTitle>How GPT models perform on the United States medical licensing examination: a systematic review</RefTitle> <RefYear>2024</RefYear> <RefJournal>Discov Appl Sci</RefJournal> <RefPage>500</RefPage> <RefTotal>Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg BS, Klang E. How GPT models perform on the United States medical licensing examination: a systematic review. Discov Appl Sci. 2024;6(10):500. DOI: 10.1007/s42452-024-06194-5</RefTotal> <RefLink>https://doi.org/10.1007/s42452-024-06194-5</RefLink> </Reference> <Reference refNo="4"> <RefAuthor>Suwała S</RefAuthor> <RefAuthor>Szulc P</RefAuthor> <RefAuthor>Guzowski C</RefAuthor> <RefAuthor>Kamińska B</RefAuthor> <RefAuthor>Dorobiała J</RefAuthor> <RefAuthor>Wojciechowska K</RefAuthor> <RefAuthor>Berska M</RefAuthor> <RefAuthor>Kubicka O</RefAuthor> <RefAuthor>Kosturkiewicz O</RefAuthor> <RefAuthor>Kosztulska B</RefAuthor> <RefAuthor>Rajewska A</RefAuthor> <RefAuthor>Junik R</RefAuthor> <RefTitle>ChatGPT-3.5 passes Poland’s medical final examination-Is it possible for ChatGPT to become a doctor in Poland?</RefTitle> <RefYear>2024</RefYear> <RefJournal>SAGE Open Med</RefJournal> <RefPage>20503121241257777</RefPage> <RefTotal>Suwała S, Szulc P, Guzowski C, Kamińska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland’s medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med. 2024 Jun 17;12:20503121241257777. DOI: 10.1177/20503121241257777</RefTotal> <RefLink>https://doi.org/10.1177/20503121241257777</RefLink> </Reference> <Reference refNo="5"> <RefAuthor>Vij O</RefAuthor> <RefAuthor>Calver H</RefAuthor> <RefAuthor>Myall N</RefAuthor> <RefAuthor>Dey M</RefAuthor> <RefAuthor>Kouranloo K</RefAuthor> <RefTitle>Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments</RefTitle> <RefYear>2024</RefYear> <RefJournal>PLoS One</RefJournal> <RefPage>e0307372</RefPage> <RefTotal>Vij O, Calver H, Myall N, Dey M, Kouranloo K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS One. 2024 Jul 31;19(7):e0307372. DOI: 10.1371/journal.pone.0307372</RefTotal> <RefLink>https://doi.org/10.1371/journal.pone.0307372</RefLink> </Reference> <Reference refNo="6"> <RefAuthor>Liu M</RefAuthor> <RefAuthor>Okuhara T</RefAuthor> <RefAuthor>Chang X</RefAuthor> <RefAuthor>Shirabe R</RefAuthor> <RefAuthor>Nishiie Y</RefAuthor> <RefAuthor>Okada H</RefAuthor> <RefAuthor>Kiuchi T</RefAuthor> <RefTitle>Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis</RefTitle> <RefYear>2024</RefYear> <RefJournal>J Med Internet Res</RefJournal> <RefPage>e60807</RefPage> <RefTotal>Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024 Jul 25;26:e60807. DOI: 10.2196/60807</RefTotal> <RefLink>https://doi.org/10.2196/60807</RefLink> </Reference> <Reference refNo="7"> <RefAuthor>Jin HK</RefAuthor> <RefAuthor>Lee HE</RefAuthor> <RefAuthor>Kim E</RefAuthor> <RefTitle>Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis</RefTitle> <RefYear>2024</RefYear> <RefJournal>BMC Med Educ</RefJournal> <RefPage>1013</RefPage> <RefTotal>Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med Educ. 2024 Sep 16;24(1):1013. DOI: 10.1186/s12909-024-05944-8</RefTotal> <RefLink>https://doi.org/10.1186/s12909-024-05944-8</RefLink> </Reference> <Reference refNo="8"> <RefAuthor>Longwell JB</RefAuthor> <RefAuthor>Hirsch I</RefAuthor> <RefAuthor>Binder F</RefAuthor> <RefAuthor>Gonzalez Conchas GA</RefAuthor> <RefAuthor>Mau D</RefAuthor> <RefAuthor>Jang R</RefAuthor> <RefAuthor>Krishnan RG</RefAuthor> <RefAuthor>Grant RC</RefAuthor> <RefTitle>Performance of Large Language Models on Medical Oncology Examination Questions</RefTitle> <RefYear>2024</RefYear> <RefJournal>JAMA Netw Open</RefJournal> <RefPage>e2417641</RefPage> <RefTotal>Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, Krishnan RG, Grant RC. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open. 2024 Jun 3;7(6):e2417641. DOI: 10.1001/jamanetworkopen.2024.17641</RefTotal> <RefLink>https://doi.org/10.1001/jamanetworkopen.2024.17641</RefLink> </Reference> <Reference refNo="9"> <RefAuthor>Tarabanis C</RefAuthor> <RefAuthor>Zahid S</RefAuthor> <RefAuthor>Mamalis M</RefAuthor> <RefAuthor>Zhang K</RefAuthor> <RefAuthor>Kalampokis E</RefAuthor> <RefAuthor>Jankelson L</RefAuthor> <RefTitle>Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions</RefTitle> <RefYear>2024</RefYear> <RefJournal>PLOS Digit Health</RefJournal> <RefPage>e0000604</RefPage> <RefTotal>Tarabanis C, Zahid S, Mamalis M, Zhang K, Kalampokis E, Jankelson L. Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions. PLOS Digit Health. 2024 Sep 17;3(9):e0000604. DOI: 10.1371/journal.pdig.0000604</RefTotal> <RefLink>https://doi.org/10.1371/journal.pdig.0000604</RefLink> </Reference> <Reference refNo="10"> <RefAuthor>Ebbinghaus H</RefAuthor> <RefTitle>Memory: a contribution to experimental psychology</RefTitle> <RefYear>2013</RefYear> <RefJournal>Ann Neurosci</RefJournal> <RefPage>155-6</RefPage> <RefTotal>Ebbinghaus H. Memory: a contribution to experimental psychology. Ann Neurosci. 2013 Oct;20(4):155-6. DOI: 10.5214/ans.0972.7531.200408</RefTotal> <RefLink>https://doi.org/10.5214/ans.0972.7531.200408</RefLink> </Reference> <Reference refNo="11"> <RefAuthor>Csaba G</RefAuthor> <RefAuthor>Szabó I</RefAuthor> <RefAuthor>Környei JL</RefAuthor> <RefAuthor>Kerényi M</RefAuthor> <RefAuthor>Füzesi Z</RefAuthor> <RefAuthor>Csathó Á</RefAuthor> <RefTitle>Variability in knowledge retention of medical students: repeated and recently learned basic science topics</RefTitle> <RefYear>2025</RefYear> <RefJournal>BMC Med Educ</RefJournal> <RefPage>523</RefPage> <RefTotal>Csaba G, Szabó I, Környei JL, Kerényi M, Füzesi Z, Csathó Á. Variability in knowledge retention of medical students: repeated and recently learned basic science topics. BMC Med Educ. 2025 Apr 11;25(1):523. DOI: 10.1186/s12909-025-07096-9</RefTotal> <RefLink>https://doi.org/10.1186/s12909-025-07096-9</RefLink> </Reference> <Reference refNo="12"> <RefAuthor>Anders ME</RefAuthor> <RefAuthor>Vuk J</RefAuthor> <RefAuthor>Rhee SW</RefAuthor> <RefTitle>Interactive retrieval practice in renal physiology improves performance on customized National Board of Medical Examiners examination of medical students</RefTitle> <RefYear>2022</RefYear> <RefJournal>Adv Physiol Educ</RefJournal> <RefPage>35-40</RefPage> <RefTotal>Anders ME, Vuk J, Rhee SW. Interactive retrieval practice in renal physiology improves performance on customized National Board of Medical Examiners examination of medical students. Adv Physiol Educ. 2022 Mar 1;46(1):35-40. DOI: 10.1152/advan.00118.2021</RefTotal> <RefLink>https://doi.org/10.1152/advan.00118.2021</RefLink> </Reference> <Reference refNo="13"> <RefAuthor>Deng F</RefAuthor> <RefAuthor>Gluckstein JA</RefAuthor> <RefAuthor>Larsen DP</RefAuthor> <RefTitle>Student-directed retrieval practice is a predictor of medical licensing examination performance</RefTitle> <RefYear>2015</RefYear> <RefJournal>Perspect Med Educ</RefJournal> <RefPage>308-13</RefPage> <RefTotal>Deng F, Gluckstein JA, Larsen DP. Student-directed retrieval practice is a predictor of medical licensing examination performance. Perspect Med Educ. 2015 Dec;4(6):308-13. DOI: 10.1007/s40037-015-0220-x</RefTotal> <RefLink>https://doi.org/10.1007/s40037-015-0220-x</RefLink> </Reference> <Reference refNo="14"> <RefAuthor>Larsen DP</RefAuthor> <RefAuthor>Dornan T</RefAuthor> <RefTitle>Quizzes and conversations: exploring the role of retrieval in medical education</RefTitle> <RefYear>2013</RefYear> <RefJournal>Med Educ</RefJournal> <RefPage>1236-41</RefPage> <RefTotal>Larsen DP, Dornan T. Quizzes and conversations: exploring the role of retrieval in medical education. Med Educ. 2013 Dec;47(12):1236-41. DOI: 10.1111/medu.12274</RefTotal> <RefLink>https://doi.org/10.1111/medu.12274</RefLink> </Reference> <Reference refNo="15"> <RefAuthor>Fraundorf SH</RefAuthor> <RefAuthor>Caddick ZA</RefAuthor> <RefAuthor>Nokes-Malach TJ</RefAuthor> <RefAuthor>Rottman BM</RefAuthor> <RefTitle>Cognitive perspectives on maintaining physicians' medical expertise: III. Strengths and weaknesses of self-assessment</RefTitle> <RefYear>2023</RefYear> <RefJournal>Cogn Res Princ Implic</RefJournal> <RefPage>58</RefPage> <RefTotal>Fraundorf SH, Caddick ZA, Nokes-Malach TJ, Rottman BM. Cognitive perspectives on maintaining physicians' medical expertise: III. Strengths and weaknesses of self-assessment. Cogn Res Princ Implic. 2023 Aug 30;8(1):58. DOI: 10.1186/s41235-023-00511-z</RefTotal> <RefLink>https://doi.org/10.1186/s41235-023-00511-z</RefLink> </Reference> <Reference refNo="16"> <RefAuthor>Kornell N</RefAuthor> <RefAuthor>Hays MJ</RefAuthor> <RefAuthor>Bjork RA</RefAuthor> <RefTitle>Unsuccessful retrieval attempts enhance subsequent learning</RefTitle> <RefYear>2009</RefYear> <RefJournal>J Exp Psychol Learn Mem Cogn</RefJournal> <RefPage>989-98</RefPage> <RefTotal>Kornell N, Hays MJ, Bjork RA. Unsuccessful retrieval attempts enhance subsequent learning. J Exp Psychol Learn Mem Cogn. 2009;35(4):989-98. DOI: 10.1037/a0015729</RefTotal> <RefLink>https://doi.org/10.1037/a0015729</RefLink> </Reference> <Reference refNo="17"> <RefAuthor>Cepeda NJ</RefAuthor> <RefAuthor>Pashler H</RefAuthor> <RefAuthor>Vul E</RefAuthor> <RefAuthor>Wixted JT</RefAuthor> <RefAuthor>Rohrer D</RefAuthor> <RefTitle>Distributed practice in verbal recall tasks: A review and quantitative synthesis</RefTitle> <RefYear>2006</RefYear> <RefJournal>Psychol Bull</RefJournal> <RefPage>354-80</RefPage> <RefTotal>Cepeda NJ, Pashler H, Vul E, Wixted JT, Rohrer D. Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychol Bull. 2006 May;132(3):354-80. DOI: 10.1037/0033-2909.132.3.354</RefTotal> <RefLink>https://doi.org/10.1037/0033-2909.132.3.354</RefLink> </Reference> <Reference refNo="18"> <RefAuthor>Miller GE</RefAuthor> <RefTitle>The assessment of clinical skills/competence/performance</RefTitle> <RefYear>1990</RefYear> <RefJournal>Acad Med</RefJournal> <RefPage>S63-7</RefPage> <RefTotal>Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990 Sep;65(9 Suppl):S63-7. DOI: 10.1097/00001888-199009000-00045</RefTotal> <RefLink>https://doi.org/10.1097/00001888-199009000-00045</RefLink> </Reference> <Reference refNo="19"> <RefAuthor>Daniel M</RefAuthor> <RefAuthor>Rencic J</RefAuthor> <RefAuthor>Durning SJ</RefAuthor> <RefAuthor>Holmboe E</RefAuthor> <RefAuthor>Santen SA</RefAuthor> <RefAuthor>Lang V</RefAuthor> <RefAuthor>Ratcliffe T</RefAuthor> <RefAuthor>Gordon D</RefAuthor> <RefAuthor>Heist B</RefAuthor> <RefAuthor>Lubarsky S</RefAuthor> <RefAuthor>Estrada CA</RefAuthor> <RefAuthor>Ballard T</RefAuthor> <RefAuthor>Artino AR Jr</RefAuthor> <RefAuthor>Sergio Da Silva A</RefAuthor> <RefAuthor>Cleary T</RefAuthor> <RefAuthor>Stojan J</RefAuthor> <RefAuthor>Gruppen LD</RefAuthor> <RefTitle>Clinical Reasoning Assessment Methods: A Scoping Review and Practical Guidance</RefTitle> <RefYear>2019</RefYear> <RefJournal>Acad Med</RefJournal> <RefPage>902-12</RefPage> <RefTotal>Daniel M, Rencic J, Durning SJ, Holmboe E, Santen SA, Lang V, Ratcliffe T, Gordon D, Heist B, Lubarsky S, Estrada CA, Ballard T, Artino AR Jr, Sergio Da Silva A, Cleary T, Stojan J, Gruppen LD. Clinical Reasoning Assessment Methods: A Scoping Review and Practical Guidance. Acad Med. 2019 Jun;94(6):902-12. DOI: 10.1097/ACM.0000000000002618</RefTotal> <RefLink>https://doi.org/10.1097/ACM.0000000000002618</RefLink> </Reference> <Reference refNo="21"> <RefAuthor>Bortz J</RefAuthor> <RefAuthor>Döring N</RefAuthor> <RefTitle>Quantitative Methoden der Datenerhebung</RefTitle> <RefYear>2006</RefYear> <RefBookTitle>Forschungsmethoden und Evaluation</RefBookTitle> <RefPage>137-293</RefPage> <RefTotal>Bortz J, Döring N. Forschungsmethoden und Evaluation. 4. ed. Berlin, Heidelberg: Springer; 2006. Quantitative Methoden der Datenerhebung; p. 137-293. DOI: 10.1007/978-3-540-33306-7_4</RefTotal> <RefLink>https://doi.org/10.1007/978-3-540-33306-7_4</RefLink> </Reference> <Reference refNo="22"> <RefAuthor>Zubairi NA</RefAuthor> <RefAuthor>AlAhmadi TS</RefAuthor> <RefAuthor>Ibrahim MH</RefAuthor> <RefAuthor>Hegazi MA</RefAuthor> <RefAuthor>Gadi FU</RefAuthor> <RefTitle>Effective use of Item Analysis to improve the Reliability and Validity of Undergraduate Medical Examinations: Evaluating the same exam over many years: a different approach</RefTitle> <RefYear>2025</RefYear> <RefJournal>Pak J Med Sci</RefJournal> <RefPage>810-5</RefPage> <RefTotal>Zubairi NA, AlAhmadi TS, Ibrahim MH, Hegazi MA, Gadi FU. Effective use of Item Analysis to improve the Reliability and Validity of Undergraduate Medical Examinations: Evaluating the same exam over many years: a different approach. Pak J Med Sci. 2025 Mar;41(3):810-5. DOI: 10.12669/pjms.41.3.10693</RefTotal> <RefLink>https://doi.org/10.12669/pjms.41.3.10693</RefLink> </Reference> <Reference refNo="23"> <RefAuthor>R Core Team</RefAuthor> <RefTitle></RefTitle> <RefYear>2025</RefYear> <RefBookTitle>R: A Language and Environment for Statistical Computing</RefBookTitle> <RefPage></RefPage> <RefTotal>R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2025.</RefTotal> </Reference> <Reference refNo="20"> <RefAuthor>AMBOSS SE</RefAuthor> <RefTitle></RefTitle> <RefYear></RefYear> <RefBookTitle>Physikum 2024</RefBookTitle> <RefPage></RefPage> <RefTotal>AMBOSS SE. Physikum 2024. [cited 2025 Jul 14]. Available from: https://next.amboss.com/de/questions</RefTotal> <RefLink>https://next.amboss.com/de/questions</RefLink> </Reference> </References> <Media> <Tables> <Table format="png"> <MediaNo>1</MediaNo> <MediaID>1</MediaID> <Caption><Pgraph><Mark1>Table 1: Questions used and their translation</Mark1></Pgraph></Caption> </Table> <Table format="png"> <MediaNo>2</MediaNo> <MediaID>2</MediaID> <Caption><Pgraph><Mark1>Table 2: Results of individual questions and AI performance</Mark1></Pgraph></Caption> </Table> <NoOfTables>2</NoOfTables> </Tables> <Figures> <Figure width="436" height="280" format="png"> <MediaNo>1</MediaNo> <MediaID>1</MediaID> <Caption><Pgraph><Mark1>Figure 1: Distribution of total student scores</Mark1></Pgraph></Caption> </Figure> <NoOfPictures>1</NoOfPictures> </Figures> <InlineFigures> <NoOfPictures>0</NoOfPictures> </InlineFigures> <Attachments> <NoOfAttachments>0</NoOfAttachments> </Attachments> </Media> </OrigData> </GmsArticle>