Performance of large language models in oral and maxillofacial surgery examinations.
Int J Oral Maxillofac Surg
; 53(10): 881-886, 2024 Oct.
Article
en En
| MEDLINE
| ID: mdl-38926015
ABSTRACT
This study aimed to determine the accuracy of large language models (LLMs) in answering oral and maxillofacial surgery (OMS) multiple choice questions. A total of 259 questions from the university's question bank were answered by the LLMs (GPT-3.5, GPT-4, Llama 2, Gemini, and Copilot). The scores per category as well as the total score out of 259 were recorded and evaluated, with the passing score set at 50%. The mean overall score amongst all LLMs was 62.5%. GPT-4 performed the best (76.8%, 95% confidence interval (CI) 71.4-82.2%), followed by Copilot (72.6%, 95% CI 67.2-78.0%), GPT-3.5 (62.2%, 95% CI 56.4-68.0%), Gemini (58.7%, 95% CI 52.9-64.5%), and Llama 2 (42.5%, 95% CI 37.1-48.6%). There was a statistically significant difference between the scores of the five LLMs overall (χ2 = 79.9, df = 4, P < 0.001) and within all categories except 'basic sciences' (P = 0.129), 'dentoalveolar and implant surgery' (P = 0.052), and 'oral medicine/pathology/radiology' (P = 0.801). The LLMs performed best in 'basic sciences' (68.9%) and poorest in 'pharmacology' (45.9%). The LLMs can be used as adjuncts in teaching, but should not be used for clinical decision-making until the models are further developed and validated.
Palabras clave
Texto completo:
1
Colección:
01-internacional
Base de datos:
MEDLINE
Asunto principal:
Cirugía Bucal
/
Evaluación Educacional
Límite:
Humans
Idioma:
En
Revista:
Int J Oral Maxillofac Surg
Asunto de la revista:
ODONTOLOGIA
Año:
2024
Tipo del documento:
Article
País de afiliación:
Singapur
Pais de publicación:
Dinamarca