Búsqueda | Portal Regional de la BVS

Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa.

Piñeiro, César; Pichel, Juan C.

Gigascience ; 132024 Jan 02.

Artículo en Inglés | MEDLINE | ID: mdl-39115958

RESUMEN

BACKGROUND: Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times. RESULTS: In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time. CONCLUSIONS: Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.

Asunto(s)

Filogenia , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Clasificación/métodos , Bases de Datos Genéticas

A machine learning approach to model the impact of line edge roughness on gate-all-around nanowire FETs while reducing the carbon footprint.

García-Loureiro, Antonio; Seoane, Natalia; Fernández, Julián G; Comesaña, Enrique; Pichel, Juan C.

PLoS One ; 18(7): e0288964, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37486944

RESUMEN

The performance and reliability of semiconductor devices scaled down to the sub-nanometer regime are being seriously affected by process-induced variability. To properly assess the impact of the different sources of fluctuations, such as line edge roughness (LER), statistical analyses involving large samples of device configurations are needed. The computational cost of such studies can be very high if 3D advanced simulation tools (TCAD) that include quantum effects are used. In this work, we present a machine learning approach to model the impact of LER on two gate-all-around nanowire FETs that is able to dramatically decrease the computational effort, thus reducing the carbon footprint of the study, while obtaining great accuracy. Finally, we demonstrate that transfer learning techniques can decrease the computing cost even further, being the carbon footprint of the study just 0.18 g of CO2 (whereas a single device TCAD study can produce up to 2.6 kg of CO2), while obtaining coefficient of determination values larger than 0.985 when using only a 10% of the input samples.

Asunto(s)

Huella de Carbono , Nanocables , Dióxido de Carbono , Reproducibilidad de los Resultados , Aprendizaje Automático

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale.

Piñeiro, César; Pichel, Juan C.

Gigascience ; 122022 12 28.

Artículo en Inglés | MEDLINE | ID: mdl-37522758

RESUMEN

BACKGROUND: High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node. RESULTS: Our approach, BigSeqKit, takes advantage of a high-performance computing-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line. CONCLUSIONS: BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.

Asunto(s)

Macrodatos , Biología Computacional , Biblioteca de Genes , Secuenciación de Nucleótidos de Alto Rendimiento , Conocimiento

A Big Data Platform for Real Time Analysis of Signs of Depression in Social Media.

Martínez-Castaño, Rodrigo; Pichel, Juan C; Losada, David E.

Int J Environ Res Public Health ; 17(13)2020 07 01.

Artículo en Inglés | MEDLINE | ID: mdl-32630341

RESUMEN

In this paper we propose a scalable platform for real-time processing of Social Media data. The platform ingests huge amounts of contents, such as Social Media posts or comments, and can support Public Health surveillance tasks. The processing and analytical needs of multiple screening tasks can easily be handled by incorporating user-defined execution graphs. The design is modular and supports different processing elements, such as crawlers to extract relevant contents or classifiers to categorise Social Media. We describe here an implementation of a use case built on the platform that monitors Social Media users and detects early signs of depression.

Asunto(s)

Depresión/epidemiología , Medios de Comunicación Sociales , Macrodatos

Very Fast Tree: speeding up the estimation of phylogenies for large alignments through parallelization and vectorization strategies.

Piñeiro, César; Abuín, José M; Pichel, Juan C.

Bioinformatics ; 36(17): 4658-4659, 2020 11 01.

Artículo en Inglés | MEDLINE | ID: mdl-32573652

RESUMEN

MOTIVATION: FastTree-2 is one of the most successful tools for inferring large phylogenies. With speed at the core of its design, there are still important issues in the FastTree-2 implementation that harm its performance and scalability. To deal with these limitations, we introduce VeryFastTree, a highly tuned implementation of the FastTree-2 tool that takes advantage of parallelization and vectorization strategies to boost performance. RESULTS: VeryFastTree is able to construct a tree on a standard server using double-precision arithmetic from an ultra-large 330k alignment in only 4.5 h, which is 7.8× and 3.5× faster than the sequential and best parallel FastTree-2 times, respectively. AVAILABILITY AND IMPLEMENTATION: VeryFastTree is available at the GitHub repository: https://github.com/citiususc/veryfasttree. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Programas Informáticos , Árboles , Algoritmos , Computadores , Filogenia , Alineación de Secuencia

A big data approach to metagenomics for all-food-sequencing.

Kobus, Robin; Abuín, José M; Müller, André; Hellmann, Sören Lukas; Pichel, Juan C; Pena, Tomás F; Hildebrandt, Andreas; Hankeln, Thomas; Schmidt, Bertil.

BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.

Artículo en Inglés | MEDLINE | ID: mdl-32164527

RESUMEN

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).

Asunto(s)

Macrodatos , Análisis de los Alimentos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metagenómica/métodos , Secuenciación Completa del Genoma/métodos , Biovigilancia , Genoma Bacteriano , Metagenoma , Microbiota/genética , Programas Informáticos

PASTASpark: multiple sequence alignment meets Big Data.

Abuín, José M; Pena, Tomás F; Pichel, Juan C.

Bioinformatics ; 33(18): 2948-2950, 2017 Sep 15.

Artículo en Inglés | MEDLINE | ID: mdl-28582480

RESUMEN

MOTIVATION: One basic step in many bioinformatics analyses is the multiple sequence alignment. One of the state-of-the-art tools to perform multiple sequence alignment is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA, which is the most expensive task in terms of time consumption. RESULTS: Speedups up to 10× with respect to single-threaded PASTA were observed, which allows to process an ultra-large dataset of 200 000 sequences within the 24-h limit. AVAILABILITY AND IMPLEMENTATION: PASTASpark is an Open Source tool available at https://github.com/citiususc/pastaspark. CONTACT: josemanuel.abuin@usc.es. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Biología Computacional/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

PLoS One ; 11(5): e0155461, 2016.

Artículo en Inglés | MEDLINE | ID: mdl-27182962

RESUMEN

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

Asunto(s)

Biología Computacional/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Humanos , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN/métodos , Navegador Web , Flujo de Trabajo

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies.

Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge.

Bioinformatics ; 31(24): 4003-5, 2015 Dec 15.

Artículo en Inglés | MEDLINE | ID: mdl-26323715

RESUMEN

UNLABELLED: BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows-Wheeler aligner (BWA). Important reductions in the execution times were observed when using this tool. In addition, BigBWA is fault tolerant and it does not require any modification of the original BWA source code. AVAILABILITY AND IMPLEMENTATION: BigBWA is available at the project GitHub repository: https://github.com/citiususc/BigBWA.

Asunto(s)

Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Genómica

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA