SparkText: Biomedical Text Mining on Big Data Framework.

Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M

Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M.

Afiliación

Ye Z; Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America.
Tafti AP; Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America.
He KY; Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, 53211, United States of America.
Wang K; Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, 44106, United States of America.
He MM; Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, CA, 90089, United States of America.

PLoS One ; 11(9): e0162721, 2016.

Article en En | MEDLINE | ID: mdl-27685652

RESUMEN

BACKGROUND: Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. RESULTS: In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. CONCLUSIONS: This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Tipo de estudio: Prognostic_studies Idioma: En Revista: PLoS One Asunto de la revista: CIENCIA / MEDICINA Año: 2016 Tipo del documento: Article País de afiliación: Estados Unidos Pais de publicación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google