Active learning for clinical text classification: is it better than random sampling?

Figueroa, Rosa L; Zeng-Treitler, Qing; Ngo, Long H; Goryachev, Sergey; Wiechmann, Eduardo P

Figueroa, Rosa L; Zeng-Treitler, Qing; Ngo, Long H; Goryachev, Sergey; Wiechmann, Eduardo P.

Afiliação

Figueroa RL; Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Concepción, Chile.

J Am Med Inform Assoc ; 19(5): 809-16, 2012.

Article em En | MEDLINE | ID: mdl-22707743

RESUMO

OBJECTIVE: This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks. DESIGN: Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results. MEASUREMENTS: Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences. RESULTS: The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm. CONCLUSION: For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.

Assuntos

Mineração de Dados/métodos; Processamento de Linguagem Natural; Algoritmos; Inteligência Artificial; Humanos; Curva ROC

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Mineração de Dados Tipo de estudo: Clinical_trials / Prognostic_studies Limite: Humans Idioma: En Revista: J Am Med Inform Assoc Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2012 Tipo de documento: Article País de afiliação: Chile País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google