Text classification performance: is the sample size the only factor to be considered?

Figueroa, Rosa L; Zeng-Treitler, Qing

Figueroa, Rosa L; Zeng-Treitler, Qing.

Afiliação

Figueroa RL; Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Chile.

Stud Health Technol Inform ; 192: 1193, 2013.

Article em En | MEDLINE | ID: mdl-23920967

RESUMO

The use of text mining and supervised machine learning algorithms on biomedical databases has become increasingly common. However, a question remains: How much data must be annotated to create a suitable training set for a machine learning classifier? In prior research with active learning in medical text classification, we found evidence that not only sample size but also some of the intrinsic characteristics of the texts being analyzed-such as the size of the vocabulary and the length of a document-may also influence the resulting classifier's performance. This study is an attempt to create a regression model to predict performance based on sample size and other text features. While the model needs to be trained on existing datasets, we believe it is feasible to predict performance without obtaining annotations from new datasets once the model is built.

Assuntos

Inteligência Artificial; Documentação/classificação; Documentação/estatística & dados numéricos; Uso Significativo/estatística & dados numéricos; Processamento de Linguagem Natural; Terminologia como Assunto; Vocabulário Controlado; Curadoria de Dados/métodos; Mineração de Dados/estatística & dados numéricos; Reconhecimento Automatizado de Padrão/métodos; Reconhecimento Automatizado de Padrão/estatística & dados numéricos; Tamanho da Amostra

Buscar no Google

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Inteligência Artificial / Vocabulário Controlado / Documentação / Uso Significativo / Terminologia como Assunto Tipo de estudo: Prognostic_studies Idioma: En Revista: Stud Health Technol Inform Assunto da revista: INFORMATICA MEDICA / PESQUISA EM SERVICOS DE SAUDE Ano de publicação: 2013 Tipo de documento: Article País de afiliação: Chile País de publicação: Holanda

Buscar no Google

Adicionar na Minha BVS

Imprimir

XML

PubMed Links