Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-37963064

RESUMO

MOTIVATION: Single-nucleotide variants (SNVs) are the most common type of genetic variation in the human genome. Accurate and efficient detection of SNVs from next-generation sequencing (NGS) data is essential for various applications in genomics and personalized medicine. However, SNV calling methods usually suffer from high computational complexity and limited accuracy. In this context, there is a need for new methods that overcome these limitations and provide fast reliable results. RESULTS: We present EMVC-2, a novel method for SNV calling from NGS data. EMVC-2 uses a multi-class ensemble classification approach based on the expectation-maximization algorithm that infers at each locus the most likely genotype from multiple labels provided by different learners. The inferred variants are then validated by a decision tree that filters out unlikely ones. We evaluate EMVC-2 on several publicly available real human NGS data for which the set of SNVs is available, and demonstrate that it outperforms state-of-the-art variant callers in terms of accuracy and speed, on average. AVAILABILITY AND IMPLEMENTATION: EMVC-2 is coded in C and Python, and is freely available for download at: https://github.com/guilledufort/EMVC-2. EMVC-2 is also available in Bioconda.


Assuntos
Motivação , Polimorfismo de Nucleotídeo Único , Humanos , Genômica/métodos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Nucleotídeos
2.
Bioinform Adv ; 2(1): vbac054, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36699360

RESUMO

Motivation: The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files. Results: We polished de novo assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios and various quality score quantizers. In all cases, we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, such as gzip, yields large storage space savings after the quantization of quality scores. Availability and supplementary information: Quantizers are freely available for download at: https://github.com/mrivarauy/QS-Quantizer.

3.
Bioinformatics ; 37(24): 4862-4864, 2021 12 11.
Artigo em Inglês | MEDLINE | ID: mdl-34128963

RESUMO

MOTIVATION: Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in <72 h). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed. RESULTS: We introduce RENANO, a reference-based lossless data compressor specifically tailored to FASTQ files generated with nanopore sequencing technologies. RENANO improves on its predecessor ENANO, currently the state of the art, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor and (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file. We compare the compression performance of RENANO against ENANO on several publicly available nanopore datasets. RENANO improves the base call sequences compression of ENANO by 39.8% in scenario (1), and by 33.5% in scenario (2), on average, over all the datasets. As for total file compression, the average improvements are 12.7% and 10.6%, respectively. We also show that RENANO consistently outperforms the recent general-purpose genomic compressor Genozip. AVAILABILITY AND IMPLEMENTATION: RENANO is freely available for download at: https://github.com/guilledufort/RENANO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Nanoporos , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Compressão de Dados/métodos
4.
Bioinformatics ; 36(16): 4506-4507, 2020 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-32470109

RESUMO

MOTIVATION: The amount of genomic data generated globally is seeing explosive growth, leading to increasing needs for processing, storage and transmission resources, which motivates the development of efficient compression tools for these data. Work so far has focused mainly on the compression of data generated by short-read technologies. However, nanopore sequencing technologies are rapidly gaining popularity due to the advantages offered by the large increase in the average size of the produced reads, the reduction in their cost and the portability of the sequencing technology. We present ENANO (Encoder for NANOpore), a novel lossless compression algorithm especially designed for nanopore sequencing FASTQ files. RESULTS: The main focus of ENANO is on the compression of the quality scores, as they dominate the size of the compressed file. ENANO offers two modes, Maximum Compression and Fast (default), which trade-off compression efficiency and speed. We tested ENANO, the current state-of-the-art compressor SPRING and the general compressor pigz on several publicly available nanopore datasets. The results show that the proposed algorithm consistently achieves the best compression performance (in both modes) on every considered nanopore dataset, with an average improvement over pigz and SPRING of >24.7% and 6.3%, respectively. In addition, in terms of encoding and decoding speeds, ENANO is 2.9× and 1.7× times faster than SPRING, respectively, with memory consumption up to 0.2 GB. AVAILABILITY AND IMPLEMENTATION: ENANO is freely available for download at: https://github.com/guilledufort/EnanoFASTQ. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Nanoporos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA