Búsqueda | Portal Regional de la BVS

1.

Prompt Engineering Paradigms for Medical Applications: Scoping Review.

Zaghir, Jamil; Naguib, Marco; Bjelogrlic, Mina; Névéol, Aurélie; Tannier, Xavier; Lovis, Christian.

J Med Internet Res ; 26: e60501, 2024 Sep 10.

Artículo en Inglés | MEDLINE | ID: mdl-39255030

RESUMEN

BACKGROUND: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.

Asunto(s)

Procesamiento de Lenguaje Natural , Humanos , Informática Médica/métodos

2.

Analysis of Responses of GPT-4 V to the Japanese National Clinical Engineer Licensing Examination.

Ishida, Kai; Arisaka, Naoya; Fujii, Kiyotaka.

J Med Syst ; 48(1): 83, 2024 Sep 11.

Artículo en Inglés | MEDLINE | ID: mdl-39259341

RESUMEN

Chat Generative Pretrained Transformer (ChatGPT; OpenAI) is a state-of-the-art large language model that can simulate human-like conversations based on user input. We evaluated the performance of GPT-4 V in the Japanese National Clinical Engineer Licensing Examination using 2,155 questions from 2012 to 2023. The average correct answer rate for all questions was 86.0%. In particular, clinical medicine, basic medicine, medical materials, biological properties, and mechanical engineering achieved a correct response rate of ≥ 90%. Conversely, medical device safety management, electrical and electronic engineering, and extracorporeal circulation obtained low correct answer rates ranging from 64.8% to 76.5%. The correct answer rates for questions that included figures/tables, required numerical calculation, figure/table â© calculation, and knowledge of Japanese Industrial Standards were 55.2%, 85.8%, 64.2% and 31.0%, respectively. The reason for the low correct answer rates is that ChatGPT lacked recognition of the images and knowledge of standards and laws. This study concludes that careful attention is required when using ChatGPT because several of its explanations lack the correct description.

Asunto(s)

Ingeniería Biomédica , Japón , Humanos , Ingeniería Biomédica/organización & administración , Concesión de Licencias/normas , Evaluación Educacional/métodos , Pueblos del Este de Asia

3.

Assessing the role of evolutionary information for enhancing protein language model embeddings.

Erckert, Kyra; Rost, Burkhard.

Sci Rep ; 14(1): 20692, 2024 09 05.

Artículo en Inglés | MEDLINE | ID: mdl-39237735

RESUMEN

Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.

Asunto(s)

Evolución Molecular , Proteínas , Alineación de Secuencia , Proteínas/metabolismo , Proteínas/genética , Proteínas/química , Alineación de Secuencia/métodos , Biología Computacional/métodos , Algoritmos , Programas Informáticos , Análisis de Secuencia de Proteína/métodos

4.

The Potential of Gemini and GPTs for Structured Report Generation based on Free-Text ¹⁸F-FDG PET/CT Breast Cancer Reports.

Chen, Kun; Xu, Wengui; Li, Xiaofeng.

Acad Radiol ; 2024 Sep 07.

Artículo en Inglés | MEDLINE | ID: mdl-39245597

RESUMEN

RATIONALE AND OBJECTIVE: To compare the performance of large language model (LLM) based Gemini and Generative Pre-trained Transformers (GPTs) in data mining and generating structured reports based on free-text PET/CT reports for breast cancer after user-defined tasks. MATERIALS AND METHODS: Breast cancer patients (mean age, 50 years ± 11 [SD]; all female) who underwent consecutive 18F-FDG PET/CT for follow-up between July 2005 and October 2023 were retrospectively included in the study. A total of twenty reports from 10 patients were used to train user-defined text prompts for Gemini and GPTs, by which structured PET/CT reports were generated. The natural language processing (NLP) generated structured reports and the structured reports annotated by nuclear medicine physicians were compared in terms of data extraction accuracy and capacity of progress decision-making. Statistical methods, including chi-square test, McNemar test and paired samples t-test, were employed in the study. RESULTS: The structured PET/CT reports for 131 patients were generated by using the two NLP techniques, including Gemini and GPTs. In general, GPTs exhibited superiority over Gemini in data mining in terms of primary lesion size (89.6% vs. 53.8%, p < 0.001) and metastatic lesions (96.3% vs 89.6%, p < 0.001). Moreover, GPTs outperformed Gemini in making decision for progress (p < 0.001) and semantic similarity (F1 score 0.930 vs 0.907, p < 0.001) for reports. CONCLUSION: GPTs outperformed Gemini in generating structured reports based on free-text PET/CT reports, which is potentially applied in clinical practice. DATA AVAILABILITY: The data used and/or analyzed during the current study are available from the corresponding author on reasonable request.

5.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5^th edition.

Günes, Yasin Celal; Cesur, Turay; Çamur, Eren; Günbey Karabekmez, Leman.

Diagn Interv Radiol ; 2024 Sep 09.

Artículo en Inglés | MEDLINE | ID: mdl-39248152

RESUMEN

PURPOSE: This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. METHODS: This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests. RESULTS: Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05). CONCLUSION: Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses. CLINICAL SIGNIFICANCE: This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.

6.

Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA.

Liu, Jilei; Shen, Hongru; Chen, Kexin; Li, Xiangchun.

Brief Bioinform ; 25(5)2024 Jul 25.

Artículo en Inglés | MEDLINE | ID: mdl-39222060

RESUMEN

Instruction-tuned large language models (LLMs) demonstrate exceptional ability to align with human intentions. We present an LLM-based model-instruction-tuned LLM for assessment of cancer (iLLMAC)-that can detect cancer using cell-free deoxyribonucleic acid (cfDNA) end-motif profiles. Developed on plasma cfDNA sequencing data from 1135 cancer patients and 1106 controls across three datasets, iLLMAC achieved area under the receiver operating curve (AUROC) of 0.866 [95% confidence interval (CI), 0.773-0.959] for cancer diagnosis and 0.924 (95% CI, 0.841-1.0) for hepatocellular carcinoma (HCC) detection using 16 end-motifs. Performance increased with more motifs, reaching 0.886 (95% CI, 0.794-0.977) and 0.956 (95% CI, 0.89-1.0) for cancer diagnosis and HCC detection, respectively, with 64 end-motifs. On an external-testing set, iLLMAC achieved AUROC of 0.912 (95% CI, 0.849-0.976) for cancer diagnosis and 0.938 (95% CI, 0.885-0.992) for HCC detection with 64 end-motifs, significantly outperforming benchmarked methods. Furthermore, iLLMAC achieved high classification performance on datasets with bisulfite and 5-hydroxymethylcytosine sequencing. Our study highlights the effectiveness of LLM-based instruction-tuning for cfDNA-based cancer detection.

Asunto(s)

Carcinoma Hepatocelular , Ácidos Nucleicos Libres de Células , Humanos , Ácidos Nucleicos Libres de Células/sangre , Carcinoma Hepatocelular/diagnóstico , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/sangre , Neoplasias Hepáticas/diagnóstico , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/sangre , Neoplasias/diagnóstico , Neoplasias/genética , Neoplasias/sangre , Curva ROC , Biomarcadores de Tumor/genética , Biomarcadores de Tumor/sangre , Motivos de Nucleótidos , Metilación de ADN

7.

Exploring the role of Large Language Models in haematology: A focused review of applications, benefits and limitations.

Mudrik, Aya; Nadkarni, Girish N; Efros, Orly; Glicksberg, Benjamin S; Klang, Eyal; Soffer, Shelly.

Br J Haematol ; 2024 Sep 03.

Artículo en Inglés | MEDLINE | ID: mdl-39226157

RESUMEN

Large language models (LLMs) have significantly impacted various fields with their ability to understand and generate human-like text. This study explores the potential benefits and limitations of integrating LLMs, such as ChatGPT, into haematology practices. Utilizing systematic review methodologies, we analysed studies published after 1 December 2022, from databases like PubMed, Web of Science and Scopus, and assessing each for bias with the QUADAS-2 tool. We reviewed 10 studies that applied LLMs in various haematology contexts. These models demonstrated proficiency in specific tasks, such as achieving 76% diagnostic accuracy for haemoglobinopathies. However, the research highlighted inconsistencies in performance and reference accuracy, indicating variability in reliability across different uses. Additionally, the limited scope of these studies and constraints on datasets could potentially limit the generalizability of our findings. The findings suggest that, while LLMs provide notable advantages in enhancing diagnostic processes and educational resources within haematology, their integration into clinical practice requires careful consideration. Before implementing them in haematology, rigorous testing and specific adaptation are essential. This involves validating their accuracy and reliability across different scenarios. Given the field's complexity, it is also critical to continuously monitor these models and adapt them responsively.

8.

A scarce dataset for ancient Arabic handwritten text recognition.

Najam, Rayyan; Faizullah, Safiullah.

Data Brief ; 56: 110813, 2024 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-39252777

RESUMEN

Developing Deep Learning Optical Character Recognition is an active area of research, where models based on deep neural networks are trained on data to eventually extract text within an image. Even though many advances are currently being made in this area in general, the Arabic OCR domain notably lacks a dataset for ancient manuscripts. Here, we fill this gap by providing both the image and textual ground truth for a collection of ancient Arabic manuscripts. This scarce dataset is collected from the central library of the Islamic University of Madinah, and it encompasses rich text spanning different geographies across centuries. Specifically, eight ancient books with a total of forty pages, both images and text, transcribed by the experts, are present in this dataset. Particularly, this dataset holds a significant value due to the unavailability of such data publicly, which conspicuously contributes to the deep learning models development/augmenting, validation, testing, and generalization by researchers and practitioners, both for the tasks of Arabic OCR and Arabic text correction.

9.

Prediction of tumor board procedural recommendations using large language models.

Aubreville, Marc; Ganz, Jonathan; Ammeling, Jonas; Rosbach, Emely; Gehrke, Thomas; Scherzad, Agmal; Hackenberg, Stephan; Goncalves, Miguel.

Eur Arch Otorhinolaryngol ; 2024 Sep 13.

Artículo en Inglés | MEDLINE | ID: mdl-39266750

RESUMEN

INTRODUCTION: Multidisciplinary tumor boards are meetings where a team of medical specialists, including medical oncologists, radiation oncologists, radiologists, surgeons, and pathologists, collaborate to determine the best treatment plan for cancer patients. While decision-making in this context is logistically and cost-intensive, it has a significant positive effect on overall cancer survival. METHODS : We evaluated the quality and accuracy of predictions by several large language models for recommending procedures by a Head and Neck Oncology tumor board, which we adapted for the task using parameter-efficient fine-tuning or in-context learning. Records were divided into two sets: n=229 used for training and n=100 records for validation of our approaches. Randomized, blinded, manual human expert classification was used to evaluate the different models. RESULTS : Treatment line congruence varied depending on the model, reaching up to 86%, with medically justifiable recommendations up to 98%. Parameter-efficient fine-tuning yielded better outcomes than in-context learning, and larger/commercial models tend to perform better. CONCLUSION: Providing precise, medically justifiable procedural recommendations for complex oncology patients is feasible. Extending the data corpus to a larger patient cohort and incorporating the latest guidelines, assuming the model can handle sufficient context length, could result in more factual and guideline-aligned responses and is anticipated to enhance model performance. We, therefore, encourage further research in this direction to improve the efficacy and reliability of large language models as support in medical decision-making processes.

10.

ChatGPT efficacy for answering musculoskeletal anatomy questions: a study evaluating quality and consistency between raters and timepoints.

Mantzou, Nikolaos; Ediaroglou, Vasileios; Drakonaki, Elena; Syggelos, Spyros A; Karageorgos, Filippos F; Totlis, Trifon.

Surg Radiol Anat ; 2024 Sep 12.

Artículo en Inglés | MEDLINE | ID: mdl-39264461

RESUMEN

PURPOSE: There is increasing interest in the use of digital platforms such as ChatGPT for anatomy education. This study aims to evaluate the efficacy of ChatGPT in providing accurate and consistent responses to questions focusing on musculoskeletal anatomy across various time points (hours and days). METHODS: A selection of 6 Anatomy-related questions were asked to ChatGPT 3.5 in 4 different timepoints. All answers were rated blindly by 3 expert raters for quality according to a 5 -point Likert Scale. Difference of 0 or 1 points in Likert scale scores between raters was considered as agreement and between different timepoints was considered as consistent indicating good reproducibility. RESULTS: There was significant variation in the quality of the answers ranging from extremely good to very poor quality. There was also variation of consistency levels between different timepoints. Answers were rated as good quality (≥ 3 in Likert scale) in 50% of cases (3/6) and as consistent in 66.6% (4/6) of cases. In the low-quality answers, significant mistakes, conflicting data or lack of information were encountered. CONCLUSION: As of the time of this article, the quality and consistency of the ChatGPT v3.5 answers is variable, thus limiting its utility as independent and reliable resource of learning musculoskeletal anatomy. Validating information by reviewing the anatomical literature is highly recommended.

11.

Potential of ChatGPT to Pass the Japanese Medical and Healthcare Professional National Licenses: A Literature Review.

Ishida, Kai; Hanada, Eisuke.

Cureus ; 16(8): e66324, 2024 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-39247019

RESUMEN

This systematic review aimed to assess the academic potential of ChatGPT (GPT-3.5, 4, and 4V) for Japanese national medical and healthcare licensing examinations, taking into account its strengths and limitations. Electronic databases such as PubMed/Medline, Google Scholar, and ICHUSHI (a Japanese medical article database) were systematically searched for relevant articles, particularly those published between January 1, 2022, and April 30, 2024. A formal narrative analysis was conducted by systematically arranging similarities and differences between individual research findings together. After rigorous screening, we reviewed 22 articles. Except for one article, all articles that evaluated GPT-4 showed that this tool could pass each exam containing text only. However, some studies also reported that, despite the possibility to pass, the results of GPT-4 were worse than those of the actual examinee. Moreover, the newest model GPT-4V insufficiently recognized images, thereby providing insufficient answers to questions that involved images and figures/tables. Therefore, their precision needs to be improved to obtain better results.

12.

Distinguishing word identity and sequence context in DNA language models.

Sanabria, Melissa; Hirsch, Jonas; Poetsch, Anna R.

BMC Bioinformatics ; 25(1): 301, 2024 Sep 13.

Artículo en Inglés | MEDLINE | ID: mdl-39272021

RESUMEN

Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model's learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.

Asunto(s)

ADN , Humanos , ADN/química , Genoma Humano , Análisis de Secuencia de ADN/métodos , Procesamiento de Lenguaje Natural , Biología Computacional/métodos

13.

Large Language Model Prompting Techniques for Advancement in Clinical Medicine.

Shah, Krish; Xu, Andrew Y; Sharma, Yatharth; Daher, Mohammed; McDonald, Christopher; Diebo, Bassel G; Daniels, Alan H.

J Clin Med ; 13(17)2024 Aug 28.

Artículo en Inglés | MEDLINE | ID: mdl-39274316

RESUMEN

Large Language Models (LLMs have the potential to revolutionize clinical medicine by enhancing healthcare access, diagnosis, surgical planning, and education. However, their utilization requires careful, prompt engineering to mitigate challenges like hallucinations and biases. Proper utilization of LLMs involves understanding foundational concepts such as tokenization, embeddings, and attention mechanisms, alongside strategic prompting techniques to ensure accurate outputs. For innovative healthcare solutions, it is essential to maintain ongoing collaboration between AI technology and medical professionals. Ethical considerations, including data security and bias mitigation, are critical to their application. By leveraging LLMs as supplementary resources in research and education, we can enhance learning and support knowledge-based inquiries, ultimately advancing the quality and accessibility of medical care. Continued research and development are necessary to fully realize the potential of LLMs in transforming healthcare.

14.

SpecRep: Adversary Emulation Based on Attack Objective Specification in Heterogeneous Infrastructures.

Portase, Radu Marian; Colesa, Adrian; Sebestyen, Gheorghe.

Sensors (Basel) ; 24(17)2024 Aug 29.

Artículo en Inglés | MEDLINE | ID: mdl-39275512

RESUMEN

Cybercriminals have become an imperative threat because they target the most valuable resource on earth, data. Organizations prepare against cyber attacks by creating Cyber Security Incident Response Teams (CSIRTs) that use various technologies to monitor and detect threats and to help perform forensics on machines and networks. Testing the limits of defense technologies and the skill of a CSIRT can be performed through adversary emulation performed by so-called "red teams". The red team's work is primarily manual and requires high skill. We propose SpecRep, a system to ease the testing of the detection capabilities of defenses in complex, heterogeneous infrastructures. SpecRep uses previously known attack specifications to construct attack scenarios based on attacker objectives instead of the traditional attack graphs or a list of actions. We create a metalanguage to describe objectives to be achieved in an attack together with a compiler that can build multiple attack scenarios that achieve the objectives. We use text processing tools aided by large language models to extract information from freely available white papers and convert them to plausible attack specifications that can then be emulated by SpecRep. We show how our system can emulate attacks against a smart home, a large enterprise, and an industrial control system.

15.

MedConceptsQA: Open source medical concepts QA benchmark.

Shoham, Ofir Ben; Rappoport, Nadav.

Comput Biol Med ; 182: 109089, 2024 Sep 13.

Artículo en Inglés | MEDLINE | ID: mdl-39276611

RESUMEN

BACKGROUND: Clinical data often includes both standardized medical codes and natural language texts. This highlights the need for Clinical Large Language Models to understand these codes and their differences. We introduce a benchmark for evaluating the understanding of medical codes by various Large Language Models. METHODS: We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conduct evaluations of the benchmark using various Large Language Models. RESULTS: Our findings show that most of the pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of 9-11% (9% for few-shot learning and 11% for zero-shot learning) compared to Llama3-OpenBioLLM-70B, the clinical Large Language Model that achieved the best results. CONCLUSION: Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. We demonstrate that most of the current state-of-the-art clinical Large Language Models achieve random guess performance, whereas GPT-3.5, GPT-4, and Llama3-70B outperform these clinical models, despite their primary focus during pre-training not being on the medical domain. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA.

16.

Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools.

Woelfle, Tim; Hirt, Julian; Janiaud, Perrine; Kappos, Ludwig; Ioannidis, John P A; Hemkens, Lars G.

J Clin Epidemiol ; : 111533, 2024 Sep 12.

Artículo en Inglés | MEDLINE | ID: mdl-39277058

RESUMEN

BACKGROUND: It is unknown whether large language models (LLMs) may facilitate time- and resource-intensive text-related processes in evidence appraisal. OBJECTIVES: To quantify the agreement of LLMs with human consensus in appraisal of scientific reporting (PRISMA) and methodological rigor (AMSTAR) of systematic reviews and design of clinical trials (PRECIS-2). To identify areas, where human-AI collaboration would outperform the traditional consensus process of human raters in efficiency. DESIGN: Five LLMs (Claude-3-Opus, Claude-2, GPT-4, GPT-3.5, Mixtral-8x22B) assessed 112 systematic reviews applying the PRISMA and AMSTAR criteria, and 56 randomized controlled trials applying PRECIS-2. We quantified agreement between human consensus and (1) individual human raters; (2) individual LLMs; (3) combined LLMs approach; (4) human-AI collaboration. Ratings were marked as deferred (undecided) in case of inconsistency between combined LLMs or between the human rater and the LLM. RESULTS: Individual human rater accuracy was 89% for PRISMA and AMSTAR, and 75% for PRECIS-2. Individual LLM accuracy was ranging from 63% (GPT-3.5) to 70% (Claude-3-Opus) for PRISMA, 53% (GPT-3.5) to 74% (Claude-3-Opus) for AMSTAR, and 38% (GPT-4) to 55% (GPT-3.5) for PRECIS-2. Combined LLM ratings led to accuracies of 75-88% for PRISMA (4-74% deferred), 74-89% for AMSTAR (6-84% deferred), and 64-79% for PRECIS-2 (29-88% deferred). Human-AI collaboration resulted in the best accuracies from 89-96% for PRISMA (25/35% deferred), 91-95% for AMSTAR (27/30% deferred), and 80-86% for PRECIS-2 (76/71% deferred). CONCLUSIONS: Current LLMs alone appraised evidence worse than humans. Human-AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological rigor (AMSTAR) but not for complex tasks such as PRECIS-2.

17.

Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation.

Wysocka, Magdalena; Wysocki, Oskar; Delmas, Maxime; Mutel, Vincent; Freitas, André.

J Biomed Inform ; : 104724, 2024 Sep 12.

Artículo en Inglés | MEDLINE | ID: mdl-39277154

RESUMEN

OBJECTIVE: The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery. METHODS: The framework involves of three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound-fungus relation determination. RESULTS: Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted. CONCLUSION: While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.

18.

ChatGPT is no nutrition encyclopedia nor knowledge base.

Chatelan, Angeline; Fonta, Pierre-Alexandre.

Clin Nutr ESPEN ; 64: 26-27, 2024 Sep 11.

Artículo en Inglés | MEDLINE | ID: mdl-39270932

19.

Colour/shape-taste correspondences across three languages in ChatGPT.

Motoki, Kosuke; Spence, Charles; Velasco, Carlos.

Cognition ; 253: 105936, 2024 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-39217782

RESUMEN

Crossmodal correspondences, the tendency for a sensory feature / attribute in one sensory modality (either physically present or merely imagined), to be associated with a sensory feature in another sensory modality, have been studied extensively, revealing consistent patterns, such as sweet tastes being associated with pink colours and round shapes across languages. The present research explores whether such correspondences are captured by ChatGPT, a large language model developed by OpenAI. Across twelve studies, this research investigates colour/shapes-taste crossmodal correspondences in ChatGPT-3.5 and -4o, focusing on associations between shapes/colours and the five basic tastes across three languages (English, Japanese, and Spanish). Studies 1A-F examined taste-shape associations, using prompts in three languages to assess ChatGPT's association of round and angular shapes with the five basic tastes. The results indicated significant, consistent, associations between shape and taste, with, for example, round shapes strongly associated with sweet/umami tastes and angular shapes with bitter/salty/sour tastes. The magnitude of shape-taste matching appears to be greater in ChatGPT-4o than in ChatGPT-3.5, and ChatGPT prompted in English and Spanish than ChatGPT prompted in Japanese. Studies 2A-F focused on colour-taste correspondences, using ChatGPT to assess associations between eleven colours and the five basic tastes. The results indicated that ChatGPT-4o, but not ChatGPT-3.5, generally replicates the patterns of colour-taste correspondences that have previously been observed in human participants. Specifically, ChatGPT-4o associates sweet tastes with pink, sour with yellow, salty with white/blue, bitter with black, and umami with red across languages. However, the magnitude/similarity of shape/colour-taste matching observed in ChatGPT-4o appears to be more pronounced (i.e., having little variance, large mean difference), which does not adequately reflect the subtle nuances typically seen in human shape/colour-taste correspondences. These findings suggest that ChatGPT captures colour/shapes-taste correspondences, with language- and GPT version-specific variations, albeit with some differences when compared to previous studies involving human participants. These findings contribute valuable knowledge to the field of crossmodal correspondences, explore the possibility of generative AI that resembles human perceptual systems and cognition across languages, and provide insight into the development and evolution of generative AI systems that capture human crossmodal correspondences.

Asunto(s)

Percepción de Color , Percepción del Gusto , Humanos , Percepción de Color/fisiología , Percepción del Gusto/fisiología , Adulto , Femenino , Masculino , Adulto Joven , Percepción de Forma/fisiología , Lenguaje , Gusto/fisiología

20.

ReconGPT: A novel artificial intelligence tool and its potential use in post-Mohs reconstructive decision-making.

Jairath, Neil; Manduca, Sophia; Que, Syril Keena T.

J Am Acad Dermatol ; 2024 Aug 31.

Artículo en Inglés | MEDLINE | ID: mdl-39222879

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA