Large language models for simplified interventional radiology reports: a comparative analysis
Abstract: Purpose
To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports.
Methods
Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis.
Results
Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88–73.14) versus 59.74 (IQR: 55.47–64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77–0.84).
Conclusions
GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.
Clinical relevance/applications
With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications
- Standort
-
Deutsche Nationalbibliothek Frankfurt am Main
- Umfang
-
Online-Ressource
- Sprache
-
Englisch
- Anmerkungen
-
Academic radiology. - 32, 2 (2025) , 888-898, ISSN: 1878-4046
- Ereignis
-
Veröffentlichung
- (wo)
-
Freiburg
- (wer)
-
Universität
- (wann)
-
2024
- Urheber
-
Can, Elif
Uller, Wibke
Vogt, Katharina
Doppler, Michael Christian
Busch, Felix
Bayerl, Nadine Christina
Ellmann, Stephan
Kader, Avan
Elkilany, Aboelyazid
Makowski, Marcus Richard
Bressem, Keno K.
Adams, Lisa C.
- DOI
-
10.1016/j.acra.2024.09.041
- URN
-
urn:nbn:de:bsz:25-freidok-2573407
- Rechteinformation
-
Open Access; Der Zugriff auf das Objekt ist unbeschränkt möglich.
- Letzte Aktualisierung
-
15.08.2025, 07:21 MESZ
Datenpartner
Deutsche Nationalbibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.
Beteiligte
- Can, Elif
- Uller, Wibke
- Vogt, Katharina
- Doppler, Michael Christian
- Busch, Felix
- Bayerl, Nadine Christina
- Ellmann, Stephan
- Kader, Avan
- Elkilany, Aboelyazid
- Makowski, Marcus Richard
- Bressem, Keno K.
- Adams, Lisa C.
- Universität
Entstanden
- 2024