Performance of large language models in oral and maxillofacial surgery examinations.

Quah, B; Yong, C W; Lai, C W M; Islam, I

Quah, B; Yong, C W; Lai, C W M; Islam, I.

Afiliación

Quah B; Faculty of Dentistry, National University of Singapore, Singapore; Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, Singapore.
Yong CW; Faculty of Dentistry, National University of Singapore, Singapore; Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, Singapore.
Lai CWM; Faculty of Dentistry, National University of Singapore, Singapore.
Islam I; Faculty of Dentistry, National University of Singapore, Singapore; Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, Singapore. Electronic address: denii@nus.edu.sg.

Int J Oral Maxillofac Surg ; 53(10): 881-886, 2024 Oct.

Article en En | MEDLINE | ID: mdl-38926015

ABSTRACT

ABSTRACT

This study aimed to determine the accuracy of large language models (LLMs) in answering oral and maxillofacial surgery (OMS) multiple choice questions. A total of 259 questions from the university's question bank were answered by the LLMs (GPT-3.5, GPT-4, Llama 2, Gemini, and Copilot). The scores per category as well as the total score out of 259 were recorded and evaluated, with the passing score set at 50%. The mean overall score amongst all LLMs was 62.5%. GPT-4 performed the best (76.8%, 95% confidence interval (CI) 71.4-82.2%), followed by Copilot (72.6%, 95% CI 67.2-78.0%), GPT-3.5 (62.2%, 95% CI 56.4-68.0%), Gemini (58.7%, 95% CI 52.9-64.5%), and Llama 2 (42.5%, 95% CI 37.1-48.6%). There was a statistically significant difference between the scores of the five LLMs overall (χ2 = 79.9, df = 4, P < 0.001) and within all categories except 'basic sciences' (P = 0.129), 'dentoalveolar and implant surgery' (P = 0.052), and 'oral medicine/pathology/radiology' (P = 0.801). The LLMs performed best in 'basic sciences' (68.9%) and poorest in 'pharmacology' (45.9%). The LLMs can be used as adjuncts in teaching, but should not be used for clinical decision-making until the models are further developed and validated.

Asunto(s)

Evaluación Educacional; Cirugía Bucal; Humanos; Evaluación Educacional/métodos; Lenguaje; Encuestas y Cuestionarios

Palabras clave

Academic performance; Artificial intelligence; Dental education; Dentistry; Oral surgery

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Cirugía Bucal / Evaluación Educacional Límite: Humans Idioma: En Revista: Int J Oral Maxillofac Surg Asunto de la revista: ODONTOLOGIA Año: 2024 Tipo del documento: Article País de afiliación: Singapur Pais de publicación: Dinamarca

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google