Improved high-dimensional prediction with Random Forests by the use of co-data.

Te Beest, Dennis E; Mes, Steven W; Wilting, Saskia M; Brakenhoff, Ruud H; van de Wiel, Mark A

Te Beest, Dennis E; Mes, Steven W; Wilting, Saskia M; Brakenhoff, Ruud H; van de Wiel, Mark A.

Afiliación

Te Beest DE; Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands.
Mes SW; Department of Otolaryngology-Head and Neck Surgery, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands.
Wilting SM; Department of Medical Oncology, Erasmus MC Cancer Institute, Erasmus University Medical Center, Rotterdam, 3015 CE, The Netherlands.
Brakenhoff RH; Department of Otolaryngology-Head and Neck Surgery, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands.
van de Wiel MA; Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands. mark.vdwiel@vumc.nl.

BMC Bioinformatics ; 18(1): 584, 2017 12 28.

Article en En | MEDLINE | ID: mdl-29281963

RESUMEN

BACKGROUND: Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting. RESULTS: Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. CONCLUSION: The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

Asunto(s)

Algoritmos; Bases de Datos como Asunto; Teorema de Bayes; Humanos; Neoplasias/genética; Curva ROC; Factores de Tiempo

Palabras clave

Classification; DNA copy number; Gene expression; Methylation; Prior information; Random forest

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Algoritmos / Bases de Datos como Asunto Tipo de estudio: Clinical_trials / Prognostic_studies / Risk_factors_studies Límite: Humans Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2017 Tipo del documento: Article País de afiliación: Países Bajos Pais de publicación: Reino Unido

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google