Búsqueda | Portal Regional de la BVS

Fast, parallel, and cache-friendly suffix array construction.

Khan, Jamshed; Rubel, Tobias; Molloy, Erin; Dhulipala, Laxman; Patro, Rob.

Algorithms Mol Biol ; 19(1): 16, 2024 Apr 28.

Artículo en Inglés | MEDLINE | ID: mdl-38679714

RESUMEN

PURPOSE: String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. METHODS: In this paper we present CAPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, CAPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. RESULTS: We show that despite its simple design, CAPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CAPS-SA can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .

Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem.

Dai, Junyan; Rubel, Tobias; Han, Yunheng; Molloy, Erin K.

Algorithms Mol Biol ; 19(1): 2, 2024 Jan 08.

Artículo en Inglés | MEDLINE | ID: mdl-38191515

RESUMEN

The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.

Dropping diversity of products of large US firms: Models and measures.

Nambiar, Ananthan; Rubel, Tobias; McCaull, James; deVries, Jon; Bedau, Mark.

PLoS One ; 17(3): e0264330, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-35294464

RESUMEN

It is widely assumed that in our lifetimes the products available in the global economy have become more diverse. This assumption is difficult to investigate directly, however, because it is difficult to collect the necessary data about every product in an economy each year. We solve this problem by mining publicly available textual descriptions of the products of every large US firms each year from 1997 to 2017. Although many aspects of economic productivity have been steadily rising during this period, our text-based measurements show that the diversity of the products of at least large US firms has steadily declined. This downward trend is visible using a variety of product diversity metrics, including some that depend on a measurement of the similarity of the products of every single pair of firms. The current state of the art in comprehensive and detailed firm-similarity measurements is a Boolean word vector model due to Hoberg and Phillips. We measure diversity using firm-similarities from this Boolean model and two more sophisticated variants, and we consistently observe a significant dropping trend in product diversity. These results make it possible to frame and start to test specific hypotheses for explaining the dropping product diversity trend.

Asunto(s)

Eficiencia

Reconciling Signaling Pathway Databases with Network Topologies.

Rubel, Tobias; Singh, Pramesh; Ritz, Anna.

Pac Symp Biocomput ; 27: 211-222, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-34890150

RESUMEN

A major goal of molecular systems biology is to understand the coordinated function of genes or proteins in response to cellular signals and to understand these dynamics in the context of disease. Signaling pathway databases such as KEGG, NetPath, NCI-PID, and Panther describe the molecular interactions involved in different cellular responses. While the same pathway may be present in different databases, prior work has shown that the particular proteins and interactions differ across database annotations. However, to our knowledge no one has attempted to quantify their structural differences. It is important to characterize artifacts or other biases within pathway databases, which can provide a more informed interpretation for downstream analyses. In this work we consider signaling pathways as graphs and we use topological measures to study their structure. We find that topological characterization using graphlets (small, connected subgraphs) distinguishes signaling pathways from appropriate null models of interaction networks. Next, we quantify topological similarity across pathway databases. Our analysis reveals that the pathways harbor database-specific characteristics implying that even though these databases describe the same pathways, they tend to be systematically different from one another. We show that pathway-specific topology can be uncovered after accounting for database-specific structure. This work presents the first step towards elucidating common pathway structure beyond their specific database annotations.Data Availability: https://github.com/Reed-CompBio/pathway-reconciliation.

Asunto(s)

Biología Computacional , Transducción de Señal , Bases de Datos Factuales , Humanos , Proteínas , Biología de Sistemas

Graphery: interactive tutorials for biological network algorithms.

Zeng, Heyuan; Zhang, Jinbiao; Preising, Gabriel A; Rubel, Tobias; Singh, Pramesh; Ritz, Anna.

Nucleic Acids Res ; 49(W1): W257-W262, 2021 07 02.

Artículo en Inglés | MEDLINE | ID: mdl-34037782

RESUMEN

Networks have been an excellent framework for modeling complex biological information, but the methodological details of network-based tools are often described for a technical audience. We have developed Graphery, an interactive tutorial webserver that illustrates foundational graph concepts frequently used in network-based methods. Each tutorial describes a graph concept along with executable Python code that can be interactively run on a graph. Users navigate each tutorial using their choice of real-world biological networks that highlight the diverse applications of network algorithms. Graphery also allows users to modify the code within each tutorial or write new programs, which all can be executed without requiring an account. Graphery accepts ideas for new tutorials and datasets that will be shaped by both computational and biological researchers, growing into a community-contributed learning platform. Graphery is available at https://graphery.reedcompbio.org/.

Asunto(s)

Algoritmos , Modelos Biológicos , Programas Informáticos , Gráficos por Computador , Mapeo de Interacción de Proteínas , Transducción de Señal

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA