Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 36
Filtrar
Más filtros











Base de datos
Intervalo de año de publicación
1.
Front Genet ; 15: 1451730, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39238787

RESUMEN

Introduction: In the realm of next-generation sequencing datasets, various characteristics can be extracted through k-mer based analysis. Among these characteristics, genome size (GS) is one that can be estimated with relative ease, yet achieving satisfactory accuracy, especially in the context of heterozygosity, remains a challenge. Methods: In this study, we introduce a high-precision genome size estimator, GSET (Genome Size Estimation Tool), which is based on k-mer histogram correction. Results: We have evaluated GSET on both simulated and real datasets. The experimental results demonstrate that this tool can estimate genome size with greater precision, even surpassing the accuracy of state-of-the-art tools. Notably, GSET also performs satisfactorily on heterozygous datasets, where other tools struggle to produce useable results. Discussion: The processing model of GSET diverges from the popular data fitting models used by similar tools. Instead, it is derived from empirical data and incorporates a correction term to mitigate the impact of sequencing errors on genome size estimation. GSET is freely available for use and can be accessed at the following URL: https://github.com/Xingyu-Liao/GSET.

2.
Artículo en Inglés | MEDLINE | ID: mdl-38991976

RESUMEN

Next-generation sequencing (NGS), represented by Illumina platforms, has been an essential cornerstone of basic and applied research. However, the sequencing error rate of 1 per 1000 bp (10-3) represents a serious hurdle for research areas focusing on rare mutations, such as somatic mosaicism or microbe heterogeneity. By examining the high-fidelity sequencing methods developed in the past decade, we summarized three major factors underlying errors and the corresponding 12 strategies mitigating these errors. We then proposed a novel framework to classify 11 preexisting representative methods according to the corresponding combinatory strategies and identified three trends that emerged during methodological developments. We further extended this analysis to eight long-read sequencing methods, emphasizing error reduction strategies. Finally, we suggest two promising future directions that could achieve comparable or even higher accuracy with lower costs in both NGS and long-read sequencing.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/economía , Humanos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/economía , Mutación
3.
Virus Evol ; 10(1): veae013, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38455683

RESUMEN

High-coverage sequencing allows the study of variants occurring at low frequencies within samples, but is susceptible to false-positives caused by sequencing error. Ion Torrent has a very low single nucleotide variant (SNV) error rate and has been employed for the majority of human papillomavirus (HPV) whole genome sequences. However, benchmarking of intrahost SNVs (iSNVs) has been challenging, partly due to limitations imposed by the HPV life cycle. We address this problem by deep sequencing three replicates for each of 31 samples of HPV type 18 (HPV18). Errors, defined as iSNVs observed in only one of three replicates, are dominated by C→T (G→A) changes, independently of trinucleotide context. True iSNVs, defined as those observed in all three replicates, instead show a more diverse SNV type distribution, with particularly elevated C→T rates in CCG context (CCG→CTG; CGG→CAG) and C→A rates in ACG context (ACG→AAG; CGT→CTT). Characterization of true iSNVs allowed us to develop two methods for detecting true variants: (1) VCFgenie, a dynamic binomial filtering tool which uses each variant's allele count and coverage instead of fixed frequency cut-offs; and (2) a machine learning binary classifier which trains eXtreme Gradient Boosting models on variant features such as quality and trinucleotide context. Each approach outperforms fixed-cut-off filtering of iSNVs, and performance is enhanced when both are used together. Our results provide improved methods for identifying true iSNVs in within-host applications across sequencing platforms, specifically using HPV18 as a case study.

4.
BMC Genomics ; 25(1): 45, 2024 Jan 09.
Artículo en Inglés | MEDLINE | ID: mdl-38195441

RESUMEN

BACKGROUND: Parameters adversely affecting the contiguity and accuracy of the assemblies from Illumina next-generation sequencing (NGS) are well described. However, past studies generally focused on their additive effects, overlooking their potential interactions possibly exacerbating one another's effects in a multiplicative manner. To investigate whether or not they act interactively on de novo genome assembly quality, we simulated sequencing data for 13 bacterial reference genomes, with varying levels of error rate, sequencing depth, PCR and optical duplicate ratios. RESULTS: We assessed the quality of assemblies from the simulated sequencing data with a number of contiguity and accuracy metrics, which we used to quantify both additive and multiplicative effects of the four parameters. We found that the tested parameters are engaged in complex interactions, exerting multiplicative, rather than additive, effects on assembly quality. Also, the ratio of non-repeated regions and GC% of the original genomes can shape how the four parameters affect assembly quality. CONCLUSIONS: We provide a framework for consideration in future studies using de novo genome assembly of bacterial genomes, e.g. in choosing the optimal sequencing depth, balancing between its positive effect on contiguity and negative effect on accuracy due to its interaction with error rate. Furthermore, the properties of the genomes to be sequenced also should be taken into account, as they might influence the effects of error sources themselves.


Asunto(s)
Genoma Bacteriano , Proyectos de Investigación , Benchmarking , Secuenciación de Nucleótidos de Alto Rendimiento
5.
ACS Synth Biol ; 12(12): 3567-3577, 2023 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-37961855

RESUMEN

A comprehensive error analysis of DNA-stored data during processing, such as DNA synthesis and sequencing, is crucial for reliable DNA data storage. Both synthesis and sequencing errors depend on the sequence and the transition of bases of nucleotides; ignoring either one of the error sources leads to technical challenges in minimizing the error rate. Here, we present a methodology and toolkit that utilizes an oligonucleotide library generated from a 10-base-shifted sequence array, which is individually labeled with unique molecular identifiers, to delineate and profile DNA synthesis and sequencing errors simultaneously. This methodology enables position- and sequence-independent error profiling of both DNA synthesis and sequencing. Using this toolkit, we report base transitional errors in both synthesis and sequencing in general DNA data storage as well as degenerate-base-augmented DNA data storage. The methodology and data presented will contribute to the development of DNA sequence designs with minimal error.


Asunto(s)
ADN , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , ADN/genética , Replicación del ADN , Nucleótidos/genética
6.
Biomolecules ; 13(6)2023 06 02.
Artículo en Inglés | MEDLINE | ID: mdl-37371514

RESUMEN

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Análisis de Secuencia de ADN/métodos , Pandemias , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Aprendizaje Automático
7.
Biodivers Data J ; 11: e96480, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38327328

RESUMEN

Here, we introduce VLF, an R package to determine the distribution of very low frequency variants (VLFs) in nucleotide and amino acid sequences for the analysis of errors in DNA sequence records. The package allows users to assess VLFs in aligned and trimmed protein-coding sequences by automatically calculating the frequency of nucleotides or amino acids in each sequence position and outputting those that occur under a user-specified frequency (default of p = 0.001). These results can then be used to explore fundamental population genetic and phylogeographic patterns, mechanisms and processes at the microevolutionary level, such as nucleotide and amino acid sequence conservation. Our package extends earlier work pertaining to an implementation of VLF analysis in Microsoft Excel, which was found to be both computationally slow and error prone. We compare those results to our own herein. Results between the two implementations are found to be highly consistent for a large DNA barcode dataset of bird species. Differences in results are readily explained by both manual human error and inadequate Linnean taxonomy (specifically, species synonymy). Here, VLF is also applied to a subset of avian barcodes to assess the extent of biological artifacts at the species level for Canada goose (Branta canadensis), as well as within a large dataset of DNA barcodes for fishes of forensic and regulatory importance. The novelty of VLF and its benefit over the previous implementation include its high level of automation, speed, scalability and ease-of-use, each desirable characteristics which will be extremely valuable as more sequence data are rapidly accumulated in popular reference databases, such as BOLD and GenBank.

8.
BMC Bioinformatics ; 22(1): 552, 2021 Nov 12.
Artículo en Inglés | MEDLINE | ID: mdl-34772337

RESUMEN

BACKGROUND: With the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV). However, the expensive cost, finite read length and high sequencing error for long-read data greatly limit the widespread adoption of SV calling. Therefore, it is urgent to establish guidance concerning sequencing coverage, read length, and error rate to maintain high SV yields and to achieve the lowest cost simultaneously. RESULTS: In this study, we generated a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods. The benchmark results demonstrate that almost all SV callers perform better when the long-read data reach 20× coverage, 20 kbp average read length, and approximately 10-7.5% or below 1% error rates. Furthermore, high sequencing coverage is the most influential factor in promoting SV calling, while it also directly determines the expensive costs. CONCLUSIONS: Based on the comprehensive evaluation results, we provide important guidelines for selecting long-read sequencing settings for efficient SV calling. We believe these recommended settings of long-read sequencing will have extraordinary guiding significance in cutting-edge genomic studies and clinical practices.


Asunto(s)
Benchmarking , Genómica , Pruebas Diagnósticas de Rutina , Variación Estructural del Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN
9.
mSystems ; 6(6): e0069721, 2021 Dec 21.
Artículo en Inglés | MEDLINE | ID: mdl-34751586

RESUMEN

16S rRNA gene sequencing is a common and cost-effective technique for characterization of microbial communities. Recent bioinformatics methods enable high-resolution detection of sequence variants of only one nucleotide difference. In this study, we utilized a very fast HashMap-based approach to detect sequence variants in six publicly available 16S rRNA gene data sets. We then use the normal distribution combined with locally estimated scatterplot smoothing (LOESS) regression to estimate background error rates as a function of sequencing depth for individual clusters of sequences. This method is computationally efficient and produces inference that yields sets of variants that are conservative and well supported by reference databases. We argue that this approach to inference is fast, simple, and scalable to large data sets and provides a high-resolution set of sequence variants which are less likely to be the result of sequencing error. IMPORTANCE Recent bioinformatics development has enabled the detection of sequence variants with a high resolution of only one single-nucleotide difference in 16S rRNA gene sequence data. Despite this progress, there are several limitations that can be associated with variant calling pipelines, such as producing a large number of low-abundance sequence variants which need to be filtered out with arbitrary thresholds in downstream analyses or having a slow runtime. In this report, we introduce a fast and scalable algorithm which infers sequence variants based on the estimation of a normally distributed background error as a function of sequencing depth. Our pipeline has attractive performance characteristics, can be used independently or in parallel with other variant callers, and provides explicit P values for each variant evaluating the hypothesis that a variant is caused by sequencing error.

10.
Genet Epidemiol ; 45(5): 537-548, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-33998042

RESUMEN

This study sets out to establish the suitability of saliva-based whole-genome sequencing (WGS) through a comparison against blood-based WGS. To fully appraise the observed differences, we developed a novel technique of pseudo-replication. We also investigated the potential of characterizing individual salivary microbiomes from non-human DNA fragments found in saliva. We observed that the majority of discordant genotype calls between blood and saliva fell into known regions of the human genome that are typically sequenced with low confidence; and could be identified by quality control measures. Pseudo-replication demonstrated that the levels of discordance between blood- and saliva-derived WGS data were entirely similar to what one would expect between technical replicates if an individual's blood or saliva had been sequenced twice. Finally, we successfully sequenced salivary microbiomes in parallel to human genomes as demonstrated by a comparison against the Human Microbiome Project.


Asunto(s)
Microbiota , Saliva , Genoma Humano , Genotipo , Humanos , Microbiota/genética , Secuenciación Completa del Genoma
11.
BioData Min ; 14(1): 27, 2021 Apr 23.
Artículo en Inglés | MEDLINE | ID: mdl-33892748

RESUMEN

BACKGROUND: As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. RESULTS: We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method's versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. CONCLUSION: Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

12.
Front Immunol ; 12: 778298, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-35003093

RESUMEN

Antibody repertoire sequencing (Rep-seq) has been widely used to reveal repertoire dynamics and to interrogate antibodies of interest at single nucleotide-level resolution. However, polymerase chain reaction (PCR) amplification introduces extensive artifacts including chimeras and nucleotide errors, leading to false discovery of antibodies and incorrect assessment of somatic hypermutations (SHMs) which subsequently mislead downstream investigations. Here, a novel approach named DUMPArts, which improves the accuracy of antibody repertoires by labeling each sample with dual barcodes and each molecule with dual unique molecular identifiers (UMIs) via minimal PCR amplification to remove artifacts, is developed. Tested by ultra-deep Rep-seq data, DUMPArts removed inter-sample chimeras, which cause artifactual shared clones and constitute approximately 15% of reads in the library, as well as intra-sample chimeras with erroneous SHMs and constituting approximately 20% of the reads, and corrected base errors and amplification biases by consensus building. The removal of these artifacts will provide an accurate assessment of antibody repertoires and benefit related studies, especially mAb discovery and antibody-guided vaccine design.


Asunto(s)
Anticuerpos/análisis , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Reacción en Cadena de la Polimerasa , Anticuerpos/genética , Artefactos , Células Cultivadas , Biblioteca de Genes , Voluntarios Sanos , Humanos , Leucocitos Mononucleares , Cultivo Primario de Células , Desarrollo de Vacunas/métodos
13.
Hum Immunol ; 82(7): 488-495, 2021 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-32386782

RESUMEN

Next-generation sequencing (NGS) has been widely adopted for clinical HLA typing and advanced immunogenetics researches. Current methodologies still face challenges in resolving cis-trans ambiguity involving distant variant positions, and the turnaround time is affected by testing volume and batching. Nanopore sequencing may become a promising addition to the existing options for HLA typing. The technology delivered by the MinION sequencer of Oxford Nanopore Technologies (ONT) can record the ionic current changes during the translocation of DNA/RNA strands through transmembrane pores and translate the signals to sequence reads. It features simple and flexible library preparations, long sequencing reads, portable and affordable sequencing devices, and rapid, real-time sequencing. However, the error rate of the sequencing reads is high and remains a hurdle for its broad application. This review article will provide a brief overview of this technology and then focus on the opportunities and challenges of using nanopore sequencing for high-resolution HLA typing and immunogenetics research.


Asunto(s)
Alelos , Antígenos HLA/genética , Prueba de Histocompatibilidad , Prueba de Histocompatibilidad/métodos , Humanos , Inmunogenética/métodos , Secuenciación de Nanoporos
14.
BMC Genomics ; 21(Suppl 10): 753, 2020 Nov 18.
Artículo en Inglés | MEDLINE | ID: mdl-33208104

RESUMEN

BACKGROUND: The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the next generation sequencing technology and greatly promoted the biological research. However, the third generation sequencing data has a high level of the sequencing error rates, which inevitably affects the downstream analysis. Although the issue of sequencing error has been improving these years, large amounts of data were produced at high sequencing errors, and huge waste will be caused if they are discarded. Thus, the error correction for the third generation sequencing data is especially important. The existing error correction methods have poor performances at heterozygous sites, which are ubiquitous in diploid and polyploidy organisms. Therefore, it is a lack of error correction algorithms for the heterozygous loci, especially at low coverages. RESULTS: In this article, we propose a error correction method, named QIHC. QIHC is a hybrid correction method, which needs both the next generation and third generation sequencing data. QIHC greatly enhances the sensitivity of identifying the heterozygous sites from sequencing errors, which leads to a high accuracy on error correction. To achieve this, QIHC established a set of probabilistic models based on Bayesian classifier, to estimate the heterozygosity of a site and makes a judgment by calculating the posterior probabilities. The proposed method is consisted of three modules, which respectively generates a pseudo reference sequence, obtains the read alignments, estimates the heterozygosity the sites and corrects the read harboring them. The last module is the core module of QIHC, which is designed to fit for the calculations of multiple cases at a heterozygous site. The other two modules enable the reads mapping to the pseudo reference sequence which somehow overcomes the inefficiency of multiple mappings that adopt by the existing error correction methods. CONCLUSIONS: To verify the performance of our method, we selected Canu and Jabba to compare with QIHC in several aspects. As a hybrid correction method, we first conducted a groups of experiments under different coverages of the next-generation sequencing data. QIHC is far ahead of Jabba on accuracy. Meanwhile, we varied the coverages of the third generation sequencing data and compared performances again among Canu, Jabba and QIHC. QIHC outperforms the other two methods on accuracy of both correcting the sequencing errors and identifying the heterozygous sites, especially at low coverage. We carried out a comparison analysis between Canu and QIHC on the different error rates of the third generation sequencing data. QIHC still performs better. Therefore, QIHC is superior to the existing error correction methods when heterozygous sites exist.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Modelos Estadísticos , Algoritmos , Teorema de Bayes , Análisis de Secuencia de ADN
15.
Viruses ; 12(10)2020 10 20.
Artículo en Inglés | MEDLINE | ID: mdl-33092085

RESUMEN

High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consensus single-nucleotide variants (SNVs). Here we have taken a systematic approach to evaluate laboratory and bioinformatic pipelines to accurately identify low-frequency SNVs in viral populations. Artificial DNA and RNA "populations" were created by introducing known SNVs at predetermined frequencies into template nucleic acid before being sequenced on an Illumina MiSeq platform. These were used to assess the effects of abundance and starting input material type, technical replicates, read length and quality, short-read aligner, and percentage frequency thresholds on the ability to accurately call variants. Analyses revealed that the abundance and type of input nucleic acid had the greatest impact on the accuracy of SNV calling as measured by a micro-averaged Matthews correlation coefficient score, with DNA and high RNA inputs (107 copies) allowing for variants to be called at a 0.2% frequency. Reduced input RNA (105 copies) required more technical replicates to maintain accuracy, while low RNA inputs (103 copies) suffered from consensus-level errors. Base errors identified at specific motifs identified in all technical replicates were also identified which can be excluded to further increase SNV calling accuracy. These findings indicate that samples with low RNA inputs should be excluded for SNV calling and reinforce the importance of optimising the technical and bioinformatics steps in pipelines that are used to accurately identify sequence variants.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple/genética , Virus/genética , ADN Viral , Genes Virales , Variación Genética , Genoma Viral , Técnicas In Vitro/métodos , Modelos Teóricos , ARN Viral
17.
Genome Biol Evol ; 12(4): 309-324, 2020 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-32163141

RESUMEN

Lichens are valuable models in symbiosis research and promising sources of biosynthetic genes for biotechnological applications. Most lichenized fungi grow slowly, resist aposymbiotic cultivation, and are poor candidates for experimentation. Obtaining contiguous, high-quality genomes for such symbiotic communities is technically challenging. Here, we present the first assembly of a lichen holo-genome from metagenomic whole-genome shotgun data comprising both PacBio long reads and Illumina short reads. The nuclear genomes of the two primary components of the lichen symbiosis-the fungus Umbilicaria pustulata (33 Mb) and the green alga Trebouxia sp. (53 Mb)-were assembled at contiguities comparable to single-species assemblies. The analysis of the read coverage pattern revealed a relative abundance of fungal to algal nuclei of ∼20:1. Gap-free, circular sequences for all organellar genomes were obtained. The bacterial community is dominated by Acidobacteriaceae and encompasses strains closely related to bacteria isolated from other lichens. Gene set analyses showed no evidence of horizontal gene transfer from algae or bacteria into the fungal genome. Our data suggest a lineage-specific loss of a putative gibberellin-20-oxidase in the fungus, a gene fusion in the fungal mitochondrion, and a relocation of an algal chloroplast gene to the algal nucleus. Major technical obstacles during reconstruction of the holo-genome were coverage differences among individual genomes surpassing three orders of magnitude. Moreover, we show that GC-rich inverted repeats paired with nonrandom sequencing error in PacBio data can result in missing gene predictions. This likely poses a general problem for genome assemblies based on long reads.


Asunto(s)
Ascomicetos/genética , Genoma Fúngico , Líquenes/genética , Metagenoma , Simbiosis , Ascomicetos/crecimiento & desarrollo , Líquenes/crecimiento & desarrollo , Filogenia
18.
Genes (Basel) ; 11(1)2020 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-31906474

RESUMEN

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, frequently by selecting a single high-quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage, but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences, which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic and population clustering analysis, we find that Consensify is less affected by artefacts than methods based on single read sampling. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other frequently used methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consensify will be a useful tool for future studies of palaeogenomes.


Asunto(s)
ADN Antiguo/análisis , Análisis de Secuencia de ADN/métodos , Algoritmos , Secuencia de Bases/genética , Mapeo Cromosómico/métodos , Análisis por Conglomerados , Genoma/genética , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Filogenia
19.
Front Oncol ; 9: 851, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31552176

RESUMEN

The insufficient standardization of diagnostic next-generation sequencing (NGS) still limits its implementation in clinical practice, with the correct detection of mutations at low variant allele frequencies (VAF) facing particular challenges. We address here the standardization of sequencing coverage depth in order to minimize the probability of false positive and false negative results, the latter being underestimated in clinical NGS. There is currently no consensus on the minimum coverage depth, and so each laboratory has to set its own parameters. To assist laboratories with the determination of the minimum coverage parameters, we provide here a user-friendly coverage calculator. Using the sequencing error only, we recommend a minimum depth of coverage of 1,650 together with a threshold of at least 30 mutated reads for a targeted NGS mutation analysis of ≥3% VAF, based on the binomial probability distribution. Moreover, our calculator also allows adding assay-specific errors occurring during DNA processing and library preparation, thus calculating with an overall error of a specific NGS assay. The estimation of correct coverage depth is recommended as a starting point when assessing thresholds of NGS assay. Our study also points to the need for guidance regarding the minimum technical requirements, which based on our experience should include the limit of detection (LOD), overall NGS assay error, input, source and quality of DNA, coverage depth, number of variant supporting reads, and total number of target reads covering variant region. Further studies are needed to define the minimum technical requirements and its reporting in diagnostic NGS.

20.
BMC Bioinformatics ; 20(1): 352, 2019 Jun 21.
Artículo en Inglés | MEDLINE | ID: mdl-31226925

RESUMEN

BACKGROUND: Third-generation sequencing platforms, such as PacBio sequencing, have been developed rapidly in recent years. PacBio sequencing generates much longer reads than the second-generation sequencing (or the next generation sequencing, NGS) technologies and it has unique sequencing error patterns. An effective read simulator is essential to evaluate and promote the development of new bioinformatics tools for PacBio sequencing data analysis. RESULTS: We developed a new PacBio Sequencing Simulator (PaSS). It can learn sequence patterns from PacBio sequencing data currently available. In addition to the distribution of read lengths and error rates, we included a context-specific sequencing error model. Compared to existing PacBio sequencing simulators such as PBSIM, LongISLND and NPBSS, PaSS performed better in many aspects. Assembly tests also suggest that reads simulated by PaSS are the most similar to experimental sequencing data. CONCLUSION: PaSS is an effective sequence simulator for PacBio sequencing. It will facilitate the evaluation and development of new analysis tools for the third-generation sequencing data.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN , Programas Informáticos , Animales , Arabidopsis/genética , Caenorhabditis elegans/genética , Simulación por Computador , Escherichia coli/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA