Búsqueda | Portal Regional de la BVS

Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies.

Rasmy, Laila; Tiryaki, Firat; Zhou, Yujia; Xiang, Yang; Tao, Cui; Xu, Hua; Zhi, Degui.

J Am Med Inform Assoc ; 27(10): 1593-1599, 2020 10 01.

Artículo en Inglés | MEDLINE | ID: mdl-32930711

RESUMEN

OBJECTIVE: Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning. MATERIALS AND METHODS: We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network. RESULTS: For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction. DISCUSSION/CONCLUSION: In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted.

Asunto(s)

Registros Electrónicos de Salud , Unified Medical Language System , Vocabulario Controlado , Anciano , Bases de Datos Factuales , Femenino , Humanos , Masculino , Persona de Mediana Edad , Curva ROC

Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus.

Li, Yalun; Luo, Yung-Hung; Wampfler, Jason A; Rubinstein, Samuel M; Tiryaki, Firat; Ashok, Kumar; Warner, Jeremy L; Xu, Hua; Yang, Ping.

JCO Clin Cancer Inform ; 4: 383-391, 2020 05.

Artículo en Inglés | MEDLINE | ID: mdl-32364754

RESUMEN

PURPOSE: Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS: We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic's EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS: Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION: We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.

Asunto(s)

Procesamiento de Lenguaje Natural , Neoplasias , Registros Electrónicos de Salud , Humanos , Neoplasias/terapia , Criterios de Evaluación de Respuesta en Tumores Sólidos

A study of deep learning approaches for medication and adverse drug event extraction from clinical text.

Wei, Qiang; Ji, Zongcheng; Li, Zhiheng; Du, Jingcheng; Wang, Jingqi; Xu, Jun; Xiang, Yang; Tiryaki, Firat; Wu, Stephen; Zhang, Yaoyun; Tao, Cui; Xu, Hua.

J Am Med Inform Assoc ; 27(1): 13-21, 2020 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-31135882

RESUMEN

OBJECTIVE: This article presents our approaches to extraction of medications and associated adverse drug events (ADEs) from clinical documents, which is the second track of the 2018 National NLP Clinical Challenges (n2c2) shared task. MATERIALS AND METHODS: The clinical corpus used in this study was from the MIMIC-III database and the organizers annotated 303 documents for training and 202 for testing. Our system consists of 2 components: a named entity recognition (NER) and a relation classification (RC) component. For each component, we implemented deep learning-based approaches (eg, BI-LSTM-CRF) and compared them with traditional machine learning approaches, namely, conditional random fields for NER and support vector machines for RC, respectively. In addition, we developed a deep learning-based joint model that recognizes ADEs and their relations to medications in 1 step using a sequence labeling approach. To further improve the performance, we also investigated different ensemble approaches to generating optimal performance by combining outputs from multiple approaches. RESULTS: Our best-performing systems achieved F1 scores of 93.45% for NER, 96.30% for RC, and 89.05% for end-to-end evaluation, which ranked #2, #1, and #1 among all participants, respectively. Additional evaluations show that the deep learning-based approaches did outperform traditional machine learning algorithms in both NER and RC. The joint model that simultaneously recognizes ADEs and their relations to medications also achieved the best performance on RC, indicating its promise for relation extraction. CONCLUSION: In this study, we developed deep learning approaches for extracting medications and their attributes such as ADEs, and demonstrated its superior performance compared with traditional machine learning algorithms, indicating its uses in broader NER and RC tasks in the medical domain.

Asunto(s)

Aprendizaje Profundo , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Registros Electrónicos de Salud , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Algoritmos , Humanos , Aprendizaje Automático , Narración , Preparaciones Farmacéuticas

Parsing clinical text using the state-of-the-art deep learning based parsers: a systematic comparison.

Zhang, Yaoyun; Tiryaki, Firat; Jiang, Min; Xu, Hua.

BMC Med Inform Decis Mak ; 19(Suppl 3): 77, 2019 04 04.

Artículo en Inglés | MEDLINE | ID: mdl-30943955

RESUMEN

BACKGROUND: A shareable repository of clinical notes is critical for advancing natural language processing (NLP) research, and therefore a goal of many NLP researchers is to create a shareable repository of clinical notes, that has breadth (from multiple institutions) as well as depth (as much individual data as possible). METHODS: We aimed to assess the degree to which individuals would be willing to contribute their health data to such a repository. A compact e-survey probed willingness to share demographic and clinical data categories. Participants were faculty, staff, and students in two geographically diverse major medical centers (Utah and New York). Such a sample could be expected to respond like a typical potential participant from the general public who is given complete and fully informed consent about the pros and cons of participating in a research study. RESULTS: 2140 respondents completed the surveys. 56% of respondents were "somewhat/definitely willing" to share clinical data with identifiers, while 89% of respondents were "somewhat (17%) /definitely willing (72%)" to share without identifiers. Results were consistent across gender, age, and education, but there were some differences by geographical region. Individuals were most reluctant (50-74%) sharing mental health, substance abuse, and domestic violence data. CONCLUSIONS: We conclude that a substantial fraction of potential patient participants, once educated about risks and benefits, would be willing to donate de-identified clinical data to a shared research repository. A slight majority even would be willing to share absent de-identification, suggesting that perceptions about data misuse are not a major concern. Such a repository of clinical notes should be invaluable for clinical NLP research and advancement.

Asunto(s)

Aprendizaje Profundo , Difusión de la Información , Procesamiento de Lenguaje Natural , Adulto , Investigación Biomédica , Confidencialidad , Anonimización de la Información , Bases de Datos como Asunto , Femenino , Humanos , Masculino , New York , Participación del Paciente , Encuestas y Cuestionarios

Time-sensitive clinical concept embeddings learned from large electronic health records.

Xiang, Yang; Xu, Jun; Si, Yuqi; Li, Zhiheng; Rasmy, Laila; Zhou, Yujia; Tiryaki, Firat; Li, Fang; Zhang, Yaoyun; Wu, Yonghui; Jiang, Xiaoqian; Zheng, Wenjin Jim; Zhi, Degui; Tao, Cui; Xu, Hua.

BMC Med Inform Decis Mak ; 19(Suppl 2): 58, 2019 04 09.

Artículo en Inglés | MEDLINE | ID: mdl-30961579

RESUMEN

BACKGROUND: Learning distributional representation of clinical concepts (e.g., diseases, drugs, and labs) is an important research area of deep learning in the medical domain. However, many existing relevant methods do not consider temporal dependencies along the longitudinal sequence of a patient's records, which may lead to incorrect selection of contexts. METHODS: To address this issue, we extended three popular concept embedding learning methods: word2vec, positive pointwise mutual information (PPMI) and FastText, to consider time-sensitive information. We then trained them on a large electronic health records (EHR) database containing about 50 million patients to generate concept embeddings and evaluated them for both intrinsic evaluations focusing on concept similarity measure and an extrinsic evaluation to assess the use of generated concept embeddings in the task of predicting disease onset. RESULTS: Our experiments show that embeddings learned from information within one visit (time window zero) improve performance on the concept similarity measure and the FastText algorithm usually had better performance than the other two algorithms. For the predictive modeling task, the optimal result was achieved by word2vec embeddings with a 30-day sliding window. CONCLUSIONS: Considering time constraints are important in training clinical concept embeddings. We expect they can benefit a series of downstream applications.

Asunto(s)

Aprendizaje Profundo , Registros Electrónicos de Salud , Algoritmos , Bases de Datos Factuales , Humanos , Almacenamiento y Recuperación de la Información , Factores de Tiempo

Relation Extraction from Clinical Narratives Using Pre-trained Language Models.

Wei, Qiang; Ji, Zongcheng; Si, Yuqi; Du, Jingcheng; Wang, Jingqi; Tiryaki, Firat; Wu, Stephen; Tao, Cui; Roberts, Kirk; Xu, Hua.

AMIA Annu Symp Proc ; 2019: 1236-1245, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-32308921

RESUMEN

Natural language processing (NLP) is useful for extracting information from clinical narratives, and both traditional machine learning methods and more-recent deep learning methods have been successful in various clinical NLP tasks. These methods often depend on traditional word embeddings that are outputs of language models (LMs). Recently, methods that are directly based on pre-trained language models themselves, followed by fine-tuning on the LMs (e.g. the Bidirectional Encoder Representations from Transformers (BERT)), have achieved state-of-the-art performance on many NLP tasks. Despite their success in the open domain and biomedical literature, these pre-trained LMs have not yet been applied to the clinical relation extraction (RE) task. In this study, we developed two different implementations of the BERT model for clinical RE tasks. Our results show that our tuned LMs outperformed previous state-of-the-art RE systems in two shared tasks, which demonstrates the potential of LM-based methods on the RE task.

Asunto(s)

Almacenamiento y Recuperación de la Información/métodos , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Conjuntos de Datos como Asunto , Humanos , Narración , Semántica

DataMed - an open source discovery index for finding biomedical datasets.

Chen, Xiaoling; Gururaj, Anupama E; Ozyurt, Burak; Liu, Ruiling; Soysal, Ergin; Cohen, Trevor; Tiryaki, Firat; Li, Yueling; Zong, Nansu; Jiang, Min; Rogith, Deevakar; Salimi, Mandana; Kim, Hyeon-Eui; Rocca-Serra, Philippe; Gonzalez-Beltran, Alejandra; Farcas, Claudiu; Johnson, Todd; Margolis, Ron; Alter, George; Sansone, Susanna-Assunta; Fore, Ian M; Ohno-Machado, Lucila; Grethe, Jeffrey S; Xu, Hua.

J Am Med Inform Assoc ; 25(3): 300-308, 2018 Mar 01.

Artículo en Inglés | MEDLINE | ID: mdl-29346583

RESUMEN

OBJECTIVE: Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. MATERIALS AND METHODS: DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. RESULTS AND CONCLUSION: Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA