RESUMO
The transposition distance problem is a classical problem in genome rearrangements, which seeks to determine the minimum number of transpositions needed to transform a linear chromosome into another represented by the permutations π and σ, respectively. This article focuses on the equivalent problem of sorting by transpositions (SBT), where σ is the identity permutation ι. Specifically, we investigate palisades, a family of permutations that are "hard" to sort, as they require numerous transpositions above the celebrated lower bound devised by Bafna and Pevzner. By determining the transposition distance of palisades, we were able to provide the exact transposition diameter for 3-permutations (TD3), a special subset of the symmetric group Sn, essential for the study of approximate solutions for SBT using the simplification technique. The exact value for TD3 has remained unknown since Elias and Hartman showed an upper bound for it. Another consequence of determining the transposition distance of palisades is that, using as lower bound the one by Bafna and Pevzner, it is impossible to guarantee approximation ratios lower than 1.375 when approximating SBT. This finding has significant implications for the study of SBT, as this problem has been the subject of intense research efforts for the past 25 years.
Assuntos
Algoritmos , Genoma , Rearranjo Gênico , Modelos GenéticosRESUMO
The most common way to calculate the rearrangement distance between two genomes is to use the size of a minimum length sequence of rearrangements that transforms one of the two given genomes into the other, where the genomes are represented as permutations using only their gene order, based on the assumption that genomes have the same gene content. With the advance of research in genome rearrangements, new works extended the classical models by either considering genomes with different gene content (unbalanced genomes) or including more genomic characteristics to the mathematical representation of the genomes, such as the distribution of intergenic regions sizes. In this study, we study the Reversal, Transposition, and Indel (Insertion and Deletion) Distance using intergenic information, which allows comparing unbalanced genomes, because indels are included in the rearrangement model (i.e., the set of possible rearrangements allowed when we compute the distance). For the particular case of transpositions and indels on unbalanced genomes, we present a 4-approximation algorithm, improving a previous 4.5 approximation. This algorithm is extended so as to deal with gene orientation and to maintain the 4-approximation factor for the Reversal, Transposition, and Indel Distance on unbalanced genomes. Furthermore, we evaluate the proposed algorithms using experiments on simulated data.
Assuntos
Rearranjo Gênico , Modelos Genéticos , Genoma/genética , Genômica , Mutação INDEL , AlgoritmosRESUMO
BACKGROUND: SORTING BY TRANSPOSITIONS (SBT) is a classical problem in genome rearrangements. In 2012, SBT was proven to be [Formula: see text]-hard and the best approximation algorithm with a 1.375 ratio was proposed in 2006 by Elias and Hartman (EH algorithm). Their algorithm employs simplification, a technique used to transform an input permutation [Formula: see text] into a simple permutation [Formula: see text], presumably easier to handle with. The permutation [Formula: see text] is obtained by inserting new symbols into [Formula: see text] in a way that the lower bound of the transposition distance of [Formula: see text] is kept on [Formula: see text]. The simplification is guaranteed to keep the lower bound, not the transposition distance. A sequence of operations sorting [Formula: see text] can be mimicked to sort [Formula: see text]. RESULTS AND CONCLUSIONS: First, using an algebraic approach, we propose a new upper bound for the transposition distance, which holds for all [Formula: see text]. Next, motivated by a problem identified in the EH algorithm, which causes it, in scenarios involving how the input permutation is simplified, to require one extra transposition above the 1.375-approximation ratio, we propose a new approximation algorithm to solve SBT ensuring the 1.375-approximation ratio for all [Formula: see text]. We implemented our algorithm and EH's. Regarding the implementation of the EH algorithm, two other issues were identified and needed to be fixed. We tested both algorithms against all permutations of size n, [Formula: see text]. The results show that the EH algorithm exceeds the approximation ratio of 1.375 for permutations with a size greater than 7. The percentage of computed distances that are equal to transposition distance, computed by the implemented algorithms are also compared with others available in the literature. Finally, we investigate the performance of both implementations on longer permutations of maximum length 500. From the experiments, we conclude that maximum and the average distances computed by our algorithm are a little better than the ones computed by the EH algorithm and the running times of both algorithms are similar, despite the time complexity of our algorithm being higher.
RESUMO
In the comparative genomics field, one way to infer the evolutionary distance between two organisms of related species is by finding the minimum number of large-scale mutations, called genome rearrangements, that transform one genome into the other. This number is referred to as the rearrangement distance. Since problems in this area emerged in the mid-1990s, several genome rearrangements have been proposed. Rearrangements that do not alter the genome content are called conservative, and in this group we have the following: the reversal, which inverts a segment of the genome; the transposition, which exchanges two consecutive segments; and the double cut and join, which cuts two different pairs of adjacent blocks and joins them differently. Seminal works compared genomes sharing the same set of conserved blocks, but nowadays, researchers started looking at genomes with unequal gene content, by allowing the use of nonconservative rearrangements such as insertion and deletion (jointly called indel). The transposition distance and the transposition and indel distance are both NP-hard. We investigate the transposition and indel distance and present a structure called labeled cycle graph, representing an instance of rearrangement distance problems for genomes with unequal gene content. This structure is used to devise a lower bound and a 2-approximation algorithm for the transposition and indel distance.
Assuntos
Genoma , Mutação INDEL , Algoritmos , Rearranjo Gênico , Genômica , Modelos GenéticosRESUMO
Problems in the genome rearrangement field are often formulated in terms of pairwise genome comparison: given two genomes [Formula: see text] and [Formula: see text], find the minimum number of genome rearrangements that may have occurred during the evolutionary process. This broad definition lacks at least two important considerations: the first being which features are extracted from genomes to create a useful mathematical model, and the second being which types of genome rearrangement events should be represented. Regarding the first consideration, seminal works in the genome rearrangement field solely used gene order to represent genomes as permutations of integer numbers, neglecting many important aspects like gene duplication, intergenic regions, and complex interactions between genes. Regarding the second consideration, some rearrangement events are widely studied such as reversals and transpositions. In this paper, we shed light on the first consideration and created a model that takes into account gene order and the number of nucleotides in intergenic regions. In addition, we consider events of reversals, transpositions, and indels (insertions and deletions) of genomic material. We present a 4-approximation algorithm for reversals and indels, a [Formula: see text]-approximation algorithm for transpositions and indels, and a 6-approximation for reversals, transpositions, and indels.
Assuntos
Genoma , Modelos Genéticos , Algoritmos , DNA Intergênico/genética , Rearranjo Gênico , GenômicaRESUMO
The rearrangement distance is a well-known problem in the field of comparative genomics. Given two genomes, the rearrangement distance is the minimum number of rearrangements in a set of allowed rearrangements (rearrangement model), which transforms one genome into the other. In rearrangement distance problems, a genome is modeled as a string, where each element represents a conserved region within the two genomes. When the orientation of the genes is known, it is represented by (plus or minus) signs assigned to the elements of the string. Two of the most studied rearrangements are reversals, which invert a segment of the genome, and transpositions, which exchange the relative positions of two adjacent segments of the genome. The first works in genome rearrangements considered that the genomes being compared had the same genetic material and that rearrangement events were restricted to reversals, transpositions, or both. El-Mabrouk extended the reversal model on signed strings to include the operations of insertion and deletion of segments in the genome, which allowed the comparison of genomes with different genetic material. Other studies also addressed this problem and, recently, this problem was proved to be solvable in polynomial time by Willing et al. For unsigned strings, we still observe a lack of results. That said, in this study we prove that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions and indels on unsigned strings; and reversals, transpositions, and indels on signed and unsigned strings. Along with the NP-hardness proofs, we present a 2-approximation algorithm for reversals on unsigned strings and 3-approximation algorithms for the other models.
Assuntos
Rearranjo Gênico/genética , Genoma/genética , Mutação INDEL/genética , Algoritmos , Genômica/métodos , Modelos GenéticosRESUMO
One of the main problems in Computational Biology is to find the evolutionary distance among species. In most approaches, such distance only involves rearrangements, which are mutations that alter large pieces of the species' genome. When we represent genomes as permutations, the problem of transforming one genome into another is equivalent to the problem of Sorting Permutations by Rearrangement Operations. The traditional approach is to consider that any rearrangement has the same probability to happen, and so, the goal is to find a minimum sequence of operations which sorts the permutation. However, studies have shown that some rearrangements are more likely to happen than others, and so a weighted approach is more realistic. In a weighted approach, the goal is to find a sequence which sorts the permutations, such that the cost of that sequence is minimum. This work introduces a new type of cost function, which is related to the amount of fragmentation caused by a rearrangement. We present some results about the lower and upper bounds for the fragmentation-weighted problems and the relation between the unweighted and the fragmentation-weighted approach. Our main results are 2-approximation algorithms for five versions of this problem involving reversals and transpositions. We also give bounds for the diameters concerning these problems and provide an improved approximation factor for simple permutations considering transpositions.
Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma , Genômica/métodos , Rearranjo Gênico , Mutação , ProbabilidadeRESUMO
In comparative genomics, rearrangements are mutations that affect a stretch of DNA sequences. Reversals and transpositions are well-known rearrangements, and each has a vast literature. The reversal and transposition distance, that is, the minimum number of reversals and transpositions needed to transform one genome into another is a relevant evolutionary distance. The problem of computing this distance when genomes are represented by permutations was proposed >20 years ago and received the name of sorting by reversals and transpositions problem. It has been the focus of a number of studies, but the computational complexity has remained open until now. We hereby solve this question and prove that it is NP-hard no matter whether genomes are represented by signed or unsigned permutations. In addition, we prove that a usual generalization of this problem, which assigns weights wρ for reversals and wτ for transpositions, is also NP-hard as long as wτ/wρ ≤ 1.5 for both signed and unsigned permutations.
Assuntos
Sequência de Bases/genética , Biologia Computacional/métodos , Genômica/métodos , Algoritmos , Rearranjo Gênico , Genoma/genética , Mutação/genéticaRESUMO
Genome rearrangements are global mutations that change large stretches of DNA sequence throughout genomes. They are rare but accumulate during the evolutionary process leading to organisms with similar genetic material in different places and orientations within the genome. Sorting by Genome Rearrangements problems seek for minimum-length sequences of rearrangements that transform one genome into the other. These problems accept alternative versions that assign weights for each event, and the goal is to find a minimum-weight sequence. We study the Sorting by Weighted Reversals and Transpositions problem on signed permutations. In this study, we use weight 2 for reversals and 3 for transpositions and consider theoretical and practical aspects in our analysis. We present two algorithms with approximation factors of 5/3 and 3/2. We also developed a generic approximation algorithm to deal with different weights for reversals and transpositions, and we show the approximation factor reached in each scenario.
Assuntos
Rearranjo Gênico/genética , Algoritmos , Genoma/genética , Genômica/métodos , Modelos Genéticos , Mutação/genéticaRESUMO
Some interesting combinatorial problems have been motivated by genome rearrangements, which are mutations that affect large portions of a genome. When we represent genomes as permutations, the goal is to transform a given permutation into the identity permutation with the minimum number of rearrangements. When they affect segments from the beginning (respectively end) of the permutation, they are called prefix (respectively suffix) rearrangements. This paper presents results for rearrangement problems that involve prefix and suffix versions of reversals and transpositions considering unsigned and signed permutations. We give 2-approximation and ([Formula: see text])-approximation algorithms for these problems, where [Formula: see text] is a constant divided by the number of breakpoints (pairs of consecutive elements that should not be consecutive in the identity permutation) in the input permutation. We also give bounds for the diameters concerning these problems and provide ways of improving the practical results of our algorithms.
Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma , Modelos Genéticos , MutaçãoRESUMO
Sorting by Transpositions is an NP-hard problem for which several polynomial-time approximation algorithms have been developed. Hartman and Shamir (2006) developed a 1.5-approximation [Formula: see text] algorithm, whose running time was improved to O(nlogn) by Feng and Zhu (2007) with a data structure they defined, the permutation tree. Elias and Hartman (2006) developed a 1.375-approximation O(n(2)) algorithm, and Firoz et al. (2011) claimed an improvement to the running time, from O(n(2)) to O(nlogn), by using the permutation tree. We provide counter-examples to the correctness of Firoz et al.'s strategy, showing that it is not possible to reach a component by sufficient extensions using the method proposed by them. In addition, we propose a 1.375-approximation algorithm, modifying Elias and Hartman's approach with the use of permutation trees and achieving O(nlogn) time.
Assuntos
Análise de Sequência de DNA , Algoritmos , Biologia Computacional , Rearranjo Gênico , Modelos GenéticosRESUMO
In this paper, we present a general heuristic for several problems in the genome rearrangement field. Our heuristic does not solve any problem directly, it is rather used to improve the solutions provided by any non-optimal algorithm that solve them. Therefore, we have implemented several algorithms described in the literature and several algorithms developed by ourselves. As a whole, we implemented 23 algorithms for 9 well known problems in the genome rearrangement field. A total of 13 algorithms were implemented for problems that use the notions of prefix and suffix operations. In addition, we worked on 5 algorithms for the classic problem of sorting by transposition and we conclude the experiments by presenting results for 3 approximation algorithms for the sorting by reversals and transpositions problem and 2 approximation algorithms for the sorting by reversals problem. Another algorithm with better approximation ratio can be found for the last genome rearrangement problem, but it is purely theoretical with no practical implementation. The algorithms we implemented in addition to our heuristic lead to the best practical results in each case. In particular, we were able to improve results on the sorting by transpositions problem, which is a very special case because many efforts have been made to generate algorithms with good results in practice and some of these algorithms provide results that equal the optimum solutions in many cases. Our source codes and benchmarks are freely available upon request from the authors so that it will be easier to compare new approaches against our results.