ReRep: computational detection of repetitive sequences in genome survey sequences (GSS).

Otto, Thomas D; Gomes, Leonardo H F; Alves-Ferreira, Marcelo; de Miranda, Antonio B; Degrave, Wim M

Otto, Thomas D; Gomes, Leonardo H F; Alves-Ferreira, Marcelo; de Miranda, Antonio B; Degrave, Wim M.

Afiliação

Otto TD; Laboratory for Functional Genomics and Bioinformatics, IOC, Fiocruz, Rio de Janeiro, Brazil. otto@fiocruz.br

BMC Bioinformatics ; 9: 366, 2008 Sep 09.

Article em En | MEDLINE | ID: mdl-18782453

RESUMO

BACKGROUND: Genome survey sequences (GSS) offer a preliminary global view of a genome since, unlike ESTs, they cover coding as well as non-coding DNA and include repetitive regions of the genome. A more precise estimation of the nature, quantity and variability of repetitive sequences very early in a genome sequencing project is of considerable importance, as such data strongly influence the estimation of genome coverage, library quality and progress in scaffold construction. Also, the elimination of repetitive sequences from the initial assembly process is important to avoid errors and unnecessary complexity. Repetitive sequences are also of interest in a variety of other studies, for instance as molecular markers. RESULTS: We designed and implemented a straightforward pipeline called ReRep, which combines bioinformatics tools for identifying repetitive structures in a GSS dataset. In a case study, we first applied the pipeline to a set of 970 GSSs, sequenced in our laboratory from the human pathogen Leishmania braziliensis, the causative agent of leishmaniosis, an important public health problem in Brazil. We also verified the applicability of ReRep to new sequencing technologies using a set of 454-reads of an Escheria coli. The behaviour of several parameters in the algorithm is evaluated and suggestions are made for tuning of the analysis. CONCLUSION: The ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient. Several potential repetitive sequences were found in a L. braziliensis GSS dataset generated in our laboratory, and further validated by the analysis of a more complete genomic dataset from the EMBL and Sanger Centre databases. ReRep also identified most of the E. coli K12 repeats prior to assembly in an example dataset obtained by automated sequencing using 454 technology. The parameters controlling the algorithm behaved consistently and may be tuned to the properties of the dataset, in particular to the length of sequencing reads and the genome coverage. ReRep is freely available for academic use at http://bioinfo.pdtis.fiocruz.br/ReRep/.

Assuntos

Algoritmos; Mapeamento Cromossômico/métodos; Genoma/genética; Sequências Repetitivas de Ácido Nucleico/genética; Análise de Sequência de DNA/métodos; Software; Sequência de Bases; Dados de Sequência Molecular

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Software / Sequências Repetitivas de Ácido Nucleico / Mapeamento Cromossômico / Genoma / Análise de Sequência de DNA Tipo de estudo: Diagnostic_studies Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2008 Tipo de documento: Article País de afiliação: Brasil País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google