Active learning for clinical text classification: is it better than random sampling?
J Am Med Inform Assoc
; 19(5): 809-16, 2012.
Article
em En
| MEDLINE
| ID: mdl-22707743
OBJECTIVE: This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks. DESIGN: Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results. MEASUREMENTS: Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences. RESULTS: The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm. CONCLUSION: For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.
Texto completo:
1
Coleções:
01-internacional
Base de dados:
MEDLINE
Assunto principal:
Processamento de Linguagem Natural
/
Mineração de Dados
Tipo de estudo:
Clinical_trials
/
Prognostic_studies
Limite:
Humans
Idioma:
En
Revista:
J Am Med Inform Assoc
Assunto da revista:
INFORMATICA MEDICA
Ano de publicação:
2012
Tipo de documento:
Article
País de afiliação:
Chile
País de publicação:
Reino Unido