Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Ren, Jie; Song, Kai; Deng, Minghua; Reinert, Gesine; Cannon, Charles H; Sun, Fengzhu

Ren, Jie; Song, Kai; Deng, Minghua; Reinert, Gesine; Cannon, Charles H; Sun, Fengzhu.

Afiliación

Ren J; Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA.
Song K; School of Mathematical Sciences, Peking University, Beijing, China.
Deng M; School of Mathematical Sciences, Peking University, Beijing, China.
Reinert G; Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK.
Cannon CH; Department of Biological Sciences, Texas Tech University, TX 79409-3131, USA, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Yunnan, China and.
Sun F; Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA, Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.

Bioinformatics ; 32(7): 993-1000, 2016 04 01.

Article en En | MEDLINE | ID: mdl-26130573

RESUMEN

MOTIVATION: Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. RESULTS: Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. AVAILABILITY AND IMPLEMENTATION: Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/â¼fsun/Programs/NGS-MC/NGS-MC.html CONTACT: fsun@usc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Genómica/métodos; Secuenciación de Nucleótidos de Alto Rendimiento; Cadenas de Markov; Algoritmos; Animales; Análisis por Conglomerados; Biología Computacional/métodos; Genoma; Modelos Estadísticos; Vertebrados

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Cadenas de Markov / Genómica / Secuenciación de Nucleótidos de Alto Rendimiento Tipo de estudio: Health_economic_evaluation / Risk_factors_studies Límite: Animals Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2016 Tipo del documento: Article País de afiliación: Estados Unidos Pais de publicación: Reino Unido

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google