Use of a large language model with instruction-tuning for reliable clinical frailty scoring.

Kee, Xiang Lee Jamie; Sng, Gerald Gui Ren; Lim, Daniel Yan Zheng; Tung, Joshua Yi Min; Abdullah, Hairil Rizal; Chowdury, Anupama Roy

Kee, Xiang Lee Jamie; Sng, Gerald Gui Ren; Lim, Daniel Yan Zheng; Tung, Joshua Yi Min; Abdullah, Hairil Rizal; Chowdury, Anupama Roy.

Afiliación

Kee XLJ; Department of Geriatric Medicine, Singapore General Hospital, Singapore, Singapore.
Sng GGR; Department of Endocrinology, Singapore General Hospital, Singapore, Singapore.
Lim DYZ; Data Science and Artificial Intelligence Laboratory, Singapore General Hospital, Singapore, Singapore.
Tung JYM; Data Science and Artificial Intelligence Laboratory, Singapore General Hospital, Singapore, Singapore.
Abdullah HR; Department of Gastroenterology, Singapore General Hospital, Singapore, Singapore.
Chowdury AR; Data Science and Artificial Intelligence Laboratory, Singapore General Hospital, Singapore, Singapore.

J Am Geriatr Soc ; 2024 Aug 06.

Article en En | MEDLINE | ID: mdl-39105505

ABSTRACT

ABSTRACT

BACKGROUND:

Frailty is an important predictor of health outcomes, characterized by increased vulnerability due to physiological decline. The Clinical Frailty Scale (CFS) is commonly used for frailty assessment but may be influenced by rater bias. Use of artificial intelligence (AI), particularly Large Language Models (LLMs) offers a promising method for efficient and reliable frailty scoring.

METHODS:

The study utilized seven standardized patient scenarios to evaluate the consistency and reliability of CFS scoring by OpenAI's GPT-3.5-turbo model. Two methods were tested a basic prompt and an instruction-tuned prompt incorporating CFS definition, a directive for accurate responses, and temperature control. The outputs were compared using the Mann-Whitney U test and Fleiss' Kappa for inter-rater reliability. The outputs were compared with historic human scores of the same scenarios.

RESULTS:

The LLM's median scores were similar to human raters, with differences of no more than one point. Significant differences in score distributions were observed between the basic and instruction-tuned prompts in five out of seven scenarios. The instruction-tuned prompt showed high inter-rater reliability (Fleiss' Kappa of 0.887) and produced consistent responses in all scenarios. Difficulty in scoring was noted in scenarios with less explicit information on activities of daily living (ADLs).

CONCLUSIONS:

This study demonstrates the potential of LLMs in consistently scoring clinical frailty with high reliability. It demonstrates that prompt engineering via instruction-tuning can be a simple but effective approach for optimizing LLMs in healthcare applications. The LLM may overestimate frailty scores when less information about ADLs is provided, possibly as it is less subject to implicit assumptions and extrapolation than humans. Future research could explore the integration of LLMs in clinical research and frailty-related outcome prediction.

Palabras clave

artificial intelligence; frailty; geriatrics

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: J Am Geriatr Soc Año: 2024 Tipo del documento: Article País de afiliación: Singapur Pais de publicación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google