Búsqueda | Portal Regional de la BVS

1.

Revisiting Viewing Graph Solvability: an Effective Approach Based on Cycle Consistency.

Arrigoni, Federica; Fusiello, Andrea; Rizzi, Romeo; Ricci, Elisa; Pajdla, Tomas.

IEEE Trans Pattern Anal Mach Intell ; PP2022 Oct 10.

Artículo en Inglés | MEDLINE | ID: mdl-36215371

RESUMEN

In the structure from motion, the viewing graph is a graph where the vertices correspond to cameras (or images) and the edges represent the fundamental matrices. We provide a new formulation and an algorithm for determining whether a viewing graph is solvable, i.e., uniquely determines a set of projective cameras. The known theoretical conditions either do not fully characterize the solvability of all viewing graphs, or are extremely difficult to compute because they involve solving a system of polynomial equations with a large number of unknowns. The main result of this paper is a method to reduce the number of unknowns by exploiting cycle consistency. We advance the understanding of solvability by (i) finishing the classification of all minimal graphs up to 9 nodes, (ii) extending the practical verification of solvability to minimal graphs with up to 90 nodes, (iii) finally answering an open research question by showing that finite solvability is not equivalent to solvability, and (iv) formally drawing the connection with the calibrated case (i.e., parallel rigidity). Finally, we present an experiment on real data that shows that unsolvable graphs may appear in practice.

2.

Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG.

Caceres, Manuel; Mumey, Brendan; Husic, Edin; Rizzi, Romeo; Cairo, Massimo; Sahlin, Kristoffer; Tomescu, Alexandru I.

IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3673-3684, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-34847041

RESUMEN

A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.

Asunto(s)

Algoritmos , Genómica , Genoma , Secuencia de Bases , ARN

3.

MIPUP: minimum perfect unmixed phylogenies for multi-sampled tumors via branchings and ILP.

Husic, Edin; Li, Xinyue; Hujdurovic, Ademir; Mehine, Miika; Rizzi, Romeo; Mäkinen, Veli; Milanic, Martin; Tomescu, Alexandru I.

Bioinformatics ; 35(5): 769-777, 2019 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-30101335

RESUMEN

MOTIVATION: Discovering the evolution of a tumor may help identify driver mutations and provide a more comprehensive view on the history of the tumor. Recent studies have tackled this problem using multiple samples sequenced from a tumor, and due to clinical implications, this has attracted great interest. However, such samples usually mix several distinct tumor subclones, which confounds the discovery of the tumor phylogeny. RESULTS: We study a natural problem formulation requiring to decompose the tumor samples into several subclones with the objective of forming a minimum perfect phylogeny. We propose an Integer Linear Programming formulation for it, and implement it into a method called MIPUP. We tested the ability of MIPUP and of four popular tools LICHeE, AncesTree, CITUP, Treeomics to reconstruct the tumor phylogeny. On simulated data, MIPUP shows up to a 34% improvement under the ancestor-descendant relations metric. On four real datasets, MIPUP's reconstructions proved to be generally more faithful than those of LICHeE. AVAILABILITY AND IMPLEMENTATION: MIPUP is available at https://github.com/zhero9/MIPUP as open source. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Neoplasias , Humanos , Mutación , Neoplasias/genética , Filogenia , Programación Lineal , Programas Informáticos

4.

Hardness of Covering Alignment: Phase Transition in Post-Sequence Genomics.

Rizzi, Romeo; Cairo, Massimo; Makinen, Veli; Tomescu, Alexandru I; Valenzuela, Daniel.

IEEE/ACM Trans Comput Biol Bioinform ; 16(1): 23-30, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-29994032

RESUMEN

Covering alignment problems arise from recent developments in genomics; so called pan-genome graphs are replacing reference genomes, and advances in haplotyping enable full content of diploid genomes to be used as basis of sequence analysis. In this paper, we show that the computational complexity will change for natural extensions of alignments to pan-genome representations and to diploid genomes. More broadly, our approach can also be seen as a minimal extension of sequence alignment to labelled directed acyclic graphs (labeled DAGs). Namely, we show that finding a covering alignment of two labeled DAGs is NP-hard even on binary alphabets. A covering alignment asks for two paths R1 (red) and G1 (green) in DAG D1 and two paths R2 (red) and G2 (green) in DAG D2 that cover the nodes of the graphs and maximize the sum of the global alignment scores: as(sp(R1),sp(R2))+as(sp(G1),sp(G2)), where sp(P) is the concatenation of labels on the path P. Pair-wise alignment of haplotype sequences forming a diploid chromosome can be converted to a two-path coverable labelled DAG, and then the covering alignment models the similarity of two diploids over arbitrary recombinations. We also give a reduction to the other direction, to show that such a recombination-oblivious diploid alignment is NP-hard on alphabets of size 3.

Asunto(s)

Genómica/métodos , Alineación de Secuencia/métodos , Algoritmos , Diploidia , Análisis de Secuencia de ADN/métodos

5.

Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly.

Tomescu, Alexandru I; Gagie, Travis; Popa, Alexandru; Rizzi, Romeo; Kuosmanen, Anna; Mäkinen, Veli.

IEEE/ACM Trans Comput Biol Bioinform ; 12(6): 1345-54, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26671806

RESUMEN

RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote ⟨G⟩, and give a dynamic programming algorithm running in time O(W(k)⟨G⟩(k)(⟨G⟩+ k)n) , where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, ⟨G⟩, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.

Asunto(s)

Algoritmos , Mapeo Cromosómico/métodos , Genoma/genética , ARN/genética , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodos , Secuencia de Bases , Datos de Secuencia Molecular

6.

On the complexity of Minimum Path Cover with Subpath Constraints for multi-assembly.

Rizzi, Romeo; Tomescu, Alexandru I; Mäkinen, Veli.

BMC Bioinformatics ; 15 Suppl 9: S5, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-25252805

RESUMEN

BACKGROUND: Multi-assembly problems have gathered much attention in the last years, as Next-Generation Sequencing technologies have started being applied to mixed settings, such as reads from the transcriptome (RNA-Seq), or from viral quasi-species. One classical model that has resurfaced in many multi-assembly methods (e.g. in Cufflinks, ShoRAH, BRANCH, CLASS) is the Minimum Path Cover (MPC) Problem, which asks for the minimum number of directed paths that cover all the nodes of a directed acyclic graph. The MPC Problem is highly popular because the acyclicity of the graph ensures its polynomial-time solvability. RESULTS: In this paper, we consider two generalizations of it dealing with integrating constraints arising from long reads or paired-end reads; these extensions have also been considered by two recent methods, but not fully solved. More specifically, we study the two problems where also a set of subpaths, or pairs of subpaths, of the graph have to be entirely covered by some path in the MPC. We show that in the case of long reads (subpaths), the generalized problem can be solved in polynomial-time by a reduction to the classical MPC Problem. We also consider the weighted case, and show that it can be solved in polynomial-time by a reduction to a min-cost circulation problem. As a side result, we also improve the time complexity of the classical minimum weight MPC Problem. In the case of paired-end reads (pairs of subpaths), the generalized problem becomes NP-hard, but we show that it is fixed-parameter tractable (FPT) in the total number of constraints. This computational dichotomy between long reads and paired-end reads is also a general insight into multi-assembly problems.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ARN/métodos , Transcriptoma , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/economía , Análisis de Secuencia de ARN/economía

7.

A novel min-cost flow method for estimating transcript expression with RNA-Seq.

Tomescu, Alexandru I; Kuosmanen, Anna; Rizzi, Romeo; Mäkinen, Veli.

BMC Bioinformatics ; 14 Suppl 5: S15, 2013.

Artículo en Inglés | MEDLINE | ID: mdl-23734627

RESUMEN

BACKGROUND: Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms), depending on the individual, on the tissue the cell is in, or in response to some stimuli. Recent RNA-Seq technology allows for new high-throughput ways for isoform identification and quantification based on short reads, and various methods have been put forward for this non-trivial problem. RESULTS: In this paper we propose a novel radically different method based on minimum-cost network flows. This has a two-fold advantage: on the one hand, it translates the problem as an established one in the field of network flows, which can be solved in polynomial time, with different existing solvers; on the other hand, it is general enough to encompass many of the previous proposals under the least sum of squares model. Our method works as follows: in order to find the transcripts which best explain, under a given fitness model, a splicing graph resulting from an RNA-Seq experiment, we find a min-cost flow in an offset flow network, under an equivalent cost model. Under very weak assumptions on the fitness model, the optimal flow can be computed in polynomial time. Parsimoniously splitting the flow back into few path transcripts can be done with any of the heuristics and approximations available from the theory of network flows. In the present implementation, we choose the simple strategy of repeatedly removing the heaviest path. CONCLUSIONS: We proposed a new very general method based on network flows for a multiassembly problem arising from isoform identification and quantification with RNA-Seq. Experimental results on prediction accuracy show that our method is very competitive with popular tools such as Cufflinks and IsoLasso. Our tool, called Traph (Transcrips in gRAPHs), is available at: http://www.cs.helsinki.fi/gsa/traph/.

Asunto(s)

Perfilación de la Expresión Génica/métodos , Isoformas de ARN/metabolismo , Análisis de Secuencia de ARN/métodos , Algoritmos , Empalme Alternativo , Humanos , Modelos Estadísticos , Programas Informáticos

8.

Hierarchical clustering using the arithmetic-harmonic cut: complexity and experiments.

Rizzi, Romeo; Mahata, Pritha; Mathieson, Luke; Moscato, Pablo.

PLoS One ; 5(12): e14067, 2010 Dec 02.

Artículo en Inglés | MEDLINE | ID: mdl-21151943

RESUMEN

Clustering, particularly hierarchical clustering, is an important method for understanding and analysing data across a wide variety of knowledge domains with notable utility in systems where the data can be classified in an evolutionary context. This paper introduces a new hierarchical clustering problem defined by a novel objective function we call the arithmetic-harmonic cut. We show that the problem of finding such a cut is NP-hard and APX-hard but is fixed-parameter tractable, which indicates that although the problem is unlikely to have a polynomial time algorithm (even for approximation), exact parameterized and local search based techniques may produce workable algorithms. To this end, we implement a memetic algorithm for the problem and demonstrate the effectiveness of the arithmetic-harmonic cut on a number of datasets including a cancer type dataset and a corona virus dataset. We show favorable performance compared to currently used hierarchical clustering techniques such as k-Means, Graclus and Normalized-Cut. The arithmetic-harmonic cut metric overcoming difficulties other hierarchical methods have in representing both intercluster differences and intracluster similarities.

Asunto(s)

Biología Computacional/métodos , Regulación Neoplásica de la Expresión Génica , Algoritmos , Línea Celular Tumoral , Análisis por Conglomerados , Neoplasias del Colon/metabolismo , Perfilación de la Expresión Génica , Humanos , Leucemia/metabolismo , Melanoma/metabolismo , Modelos Estadísticos , Modelos Teóricos

9.

Pure parsimony xor haplotyping.

Bonizzoni, Paola; Della Vedova, Gianluca; Dondi, Riccardo; Pirola, Yuri; Rizzi, Romeo.

IEEE/ACM Trans Comput Biol Bioinform ; 7(4): 598-610, 2010.

Artículo en Inglés | MEDLINE | ID: mdl-20498511

RESUMEN

The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper, we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given single nucleotide polymorphisms (SNP). Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project.

Asunto(s)

Biología Computacional/métodos , Haplotipos , Algoritmos , Genotipo , Heterocigoto , Polimorfismo de Nucleótido Simple

10.

Haplotyping for disease association: a combinatorial approach.

Lancia, Giuseppe; Ravi, R; Rizzi, Romeo.

IEEE/ACM Trans Comput Biol Bioinform ; 5(2): 245-51, 2008.

Artículo en Inglés | MEDLINE | ID: mdl-18451433

RESUMEN

We consider a combinatorial problem derived from haplotyping a population with respect to a genetic disease, either recessive or dominant. Given a set of individuals, partitioned into healthy and diseased, and the corresponding sets of genotypes, we want to infer "bad'' and "good'' haplotypes to account for these genotypes and for the disease. Assume e.g. the disease is recessive. Then, the resolving haplotypes must consist of bad and good haplotypes, so that (i) each genotype belonging to a diseased individual is explained by a pair of bad haplotypes and (ii) each genotype belonging to a healthy individual is explained by a pair of haplotypes of which at least one is good. We prove that the associated decision problem is NP-complete. However, we also prove that there is a simple solution, provided the data satisfy a very weak requirement.

Asunto(s)

Enfermedades Genéticas Congénitas/genética , Haplotipos/genética , Modelos Genéticos , Biología Computacional , Femenino , Predisposición Genética a la Enfermedad , Genotipo , Humanos , Masculino , Matemática , Polimorfismo de Nucleótido Simple

11.

Comparing genomes with duplications: a computational complexity point of view.

Blin, Guillaume; Chauve, Cedric; Fertin, Guillaume; Rizzi, Romeo; Vialette, Stéphane.

IEEE/ACM Trans Comput Biol Bioinform ; 4(4): 523-34, 2007.

Artículo en Inglés | MEDLINE | ID: mdl-17975264

RESUMEN

In this paper, we are interested in the computational complexity of computing (dis)similarity measures between two genomes when they contain duplicated genes or genomic markers, a problem that happens frequently when comparing whole nuclear genomes. Recently, several methods ( [1], [2]) have been proposed that are based on two steps to compute a given (dis)similarity measure M between two genomes G_1 and G_2: first, one establishes a oneto- one correspondence between genes of G_1 and genes of G_2 ; second, once this correspondence is established, it defines explicitly a permutation and it is then possible to quantify their similarity using classical measures defined for permutations, like the number of breakpoints. Hence these methods rely on two elements: a way to establish a one-to-one correspondence between genes of a pair of genomes, and a (dis)similarity measure for permutations. The problem is then, given a (dis)similarity measure for permutations, to compute a correspondence that defines an optimal permutation for this measure. We are interested here in two models to compute a one-to-one correspondence: the exemplar model, where all but one copy are deleted in both genomes for each gene family, and the matching model, that computes a maximal correspondence for each gene family. We show that for these two models, and for three (dis)similarity measures on permutations, namely the number of common intervals, the maximum adjacency disruption (MAD) number and the summed adjacency disruption (SAD) number, the problem of computing an optimal correspondence is NP-complete, and even APXhard for the MAD number and SAD number.

Asunto(s)

Biología Computacional/métodos , Duplicación de Gen , Genoma , Algoritmos , Bases de Datos Factuales , Eliminación de Gen , Marcadores Genéticos , Genómica , Modelos Genéticos , Modelos Estadísticos , Modelos Teóricos

12.

The approximability of the String Barcoding problem.

Lancia, Giuseppe; Rizzi, Romeo.

Algorithms Mol Biol ; 1: 12, 2006 Aug 08.

Artículo en Inglés | MEDLINE | ID: mdl-16895600

RESUMEN

The String Barcoding (SBC) problem, introduced by Rash and Gusfield (RECOMB, 2002), consists in finding a minimum set of substrings that can be used to distinguish between all members of a set of given strings. In a computational biology context, the given strings represent a set of known viruses, while the substrings can be used as probes for an hybridization experiment via microarray. Eventually, one aims at the classification of new strings (unknown viruses) through the result of the hybridization experiment. In this paper we show that SBC is as hard to approximate as Set Cover. Furthermore, we show that the constrained version of SBC (with probes of bounded length) is also hard to approximate. These negative results are tight.

13.

More reliable protein NMR peak assignment via improved 2-interval scheduling.

Chen, Zhi-Zhong; Lin, Guohui; Rizzi, Romeo; Wen, Jianjun; Xu, Dong; Xu, Ying; Jiang, Tao.

J Comput Biol ; 12(2): 129-46, 2005 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-15767773

RESUMEN

Protein NMR peak assignment refers to the process of assigning a group of "spin systems" obtained experimentally to a protein sequence of amino acids. The automation of this process is still an unsolved and challenging problem in NMR protein structure determination. Recently, protein NMR peak assignment has been formulated as an interval scheduling problem (ISP), where a protein sequence P of amino acids is viewed as a discrete time interval I (the amino acids on P one-to-one correspond to the time units of I), each subset S of spin systems that are known to originate from consecutive amino acids from P is viewed as a "job" j(s), the preference of assigning S to a subsequence P of consecutive amino acids on P is viewed as the profit of executing job j(s) in the subinterval of I corresponding to P, and the goal is to maximize the total profit of executing the jobs (on a single machine) during I. The interval scheduling problem is max SNP-hard in general; but in the real practice of protein NMR peak assignment, each job j(s) usually requires at most 10 consecutive time units, and typically the jobs that require one or two consecutive time units are the most difficult to assign/schedule. In order to solve these most difficult assignments, we present an efficient 13/7-approximation algorithm for the special case of the interval scheduling problem where each job takes one or two consecutive time units. Combining this algorithm with a greedy filtering strategy for handling long jobs (i.e., jobs that need more than two consecutive time units), we obtain a new efficient heuristic for protein NMR peak assignment. Our experimental study shows that the new heuristic produces the best peak assignment in most of the cases, compared with the NMR peak assignment algorithms in the recent literature. The above algorithm is also the first approximation algorithm for a nontrivial case of the well-known interval scheduling problem that breaks the ratio 2 barrier.

Asunto(s)

Biología Computacional/estadística & datos numéricos , Espectroscopía de Resonancia Magnética/estadística & datos numéricos , Proteínas/química , Algoritmos , Animales , Humanos

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA