RESUMO
The most common way to calculate the rearrangement distance between two genomes is to use the size of a minimum length sequence of rearrangements that transforms one of the two given genomes into the other, where the genomes are represented as permutations using only their gene order, based on the assumption that genomes have the same gene content. With the advance of research in genome rearrangements, new works extended the classical models by either considering genomes with different gene content (unbalanced genomes) or including more genomic characteristics to the mathematical representation of the genomes, such as the distribution of intergenic regions sizes. In this study, we study the Reversal, Transposition, and Indel (Insertion and Deletion) Distance using intergenic information, which allows comparing unbalanced genomes, because indels are included in the rearrangement model (i.e., the set of possible rearrangements allowed when we compute the distance). For the particular case of transpositions and indels on unbalanced genomes, we present a 4-approximation algorithm, improving a previous 4.5 approximation. This algorithm is extended so as to deal with gene orientation and to maintain the 4-approximation factor for the Reversal, Transposition, and Indel Distance on unbalanced genomes. Furthermore, we evaluate the proposed algorithms using experiments on simulated data.
Assuntos
Rearranjo Gênico , Modelos Genéticos , Genoma/genética , Genômica , Mutação INDEL , AlgoritmosRESUMO
Genome rearrangement events are widely used to estimate a minimum-size sequence of mutations capable of transforming a genome into another. The length of this sequence is called distance, and determining it is the main goal in genome rearrangement distance problems. Problems in the genome rearrangement field differ regarding the set of rearrangement events allowed and the genome representation. In this work, we consider the scenario where the genomes share the same set of genes, gene orientation is known or unknown, and intergenic regions (structures between a pair of genes and at the extremities of the genome) are taken into account. We use two models, the first model allows only conservative events (reversals and moves), and the second model includes non-conservative events (insertions and deletions) in the intergenic regions. We show that both models result in NP-hard problems no matter if gene orientation is known or unknown. When the information regarding the orientation of genes is available, we present for both models an approximation algorithm with a factor of 2. For the scenario where this information is unavailable, we propose a 4-approximation algorithm for both models.
Assuntos
Rearranjo Gênico , Modelos Genéticos , DNA Intergênico/genética , Genoma , Mutação , AlgoritmosRESUMO
Sports sciences are increasingly data-intensive nowadays since computational tools can extract information from large amounts of data and derive insights from athlete performances during the competition. This paper addresses a performance prediction problem in soccer, a popular collective sport modality played by two teams competing against each other in the same field. In a soccer game, teams score points by placing the ball into the opponent's goal and the winner is the team with the highest count of goals. Retaining possession of the ball is one key to success, but it is not enough since a team needs to score to achieve victory, which requires an offensive toward the opponent's goal. The focus of this work is to determine if analyzing the first five seconds after the control of the ball is taken by one of the teams provides enough information to determine whether the ball will reach the final quarter of the soccer field, therefore creating a goal-scoring chance. By doing so, we can further investigate which conditions increase strategic leverage. Our approach comprises modeling players' interactions as graph structures and extracting metrics from these structures. These metrics, when combined, form time series that we encode in two-dimensional representations of visual rhythms, allowing feature extraction through deep convolutional networks, coupled with a classifier to predict the outcome (whether the final quarter of the field is reached). The results indicate that offensive play near the adversary penalty area can be predicted by looking at the first five seconds. Finally, the explainability of our models reveals the main metrics along with its contributions for the final inference result, which corroborates other studies found in the literature for soccer match analysis.
Assuntos
Desempenho Atlético , Futebol , Humanos , Logro , Fatores de TempoRESUMO
In the comparative genomics field, one way to infer the evolutionary distance between two organisms of related species is by finding the minimum number of large-scale mutations, called genome rearrangements, that transform one genome into the other. This number is referred to as the rearrangement distance. Since problems in this area emerged in the mid-1990s, several genome rearrangements have been proposed. Rearrangements that do not alter the genome content are called conservative, and in this group we have the following: the reversal, which inverts a segment of the genome; the transposition, which exchanges two consecutive segments; and the double cut and join, which cuts two different pairs of adjacent blocks and joins them differently. Seminal works compared genomes sharing the same set of conserved blocks, but nowadays, researchers started looking at genomes with unequal gene content, by allowing the use of nonconservative rearrangements such as insertion and deletion (jointly called indel). The transposition distance and the transposition and indel distance are both NP-hard. We investigate the transposition and indel distance and present a structure called labeled cycle graph, representing an instance of rearrangement distance problems for genomes with unequal gene content. This structure is used to devise a lower bound and a 2-approximation algorithm for the transposition and indel distance.
Assuntos
Genoma , Mutação INDEL , Algoritmos , Rearranjo Gênico , Genômica , Modelos GenéticosRESUMO
BACKGROUND: In the comparative genomics field, one of the goals is to estimate a sequence of genetic changes capable of transforming a genome into another. Genome rearrangement events are mutations that can alter the genetic content or the arrangement of elements from the genome. Reversal and transposition are two of the most studied genome rearrangement events. A reversal inverts a segment of a genome while a transposition swaps two consecutive segments. Initial studies in the area considered only the order of the genes. Recent works have incorporated other genetic information in the model. In particular, the information regarding the size of intergenic regions, which are structures between each pair of genes and in the extremities of a linear genome. RESULTS AND CONCLUSIONS: In this work, we investigate the SORTING BY INTERGENIC REVERSALS AND TRANSPOSITIONS problem on genomes sharing the same set of genes, considering the cases where the orientation of genes is known and unknown. Besides, we explored a variant of the problem, which generalizes the transposition event. As a result, we present an approximation algorithm that guarantees an approximation factor of 4 for both cases considering the reversal and transposition (classic definition) events, an improvement from the 4.5-approximation previously known for the scenario where the orientation of the genes is unknown. We also present a 3-approximation algorithm by incorporating the generalized transposition event, and we propose a greedy strategy to improve the performance of the algorithms. We performed practical tests adopting simulated data which indicated that the algorithms, in both cases, tend to perform better when compared with the best-known algorithms for the problem. Lastly, we conducted experiments using real genomes to demonstrate the applicability of the algorithms.
RESUMO
Problems in the genome rearrangement field are often formulated in terms of pairwise genome comparison: given two genomes [Formula: see text] and [Formula: see text], find the minimum number of genome rearrangements that may have occurred during the evolutionary process. This broad definition lacks at least two important considerations: the first being which features are extracted from genomes to create a useful mathematical model, and the second being which types of genome rearrangement events should be represented. Regarding the first consideration, seminal works in the genome rearrangement field solely used gene order to represent genomes as permutations of integer numbers, neglecting many important aspects like gene duplication, intergenic regions, and complex interactions between genes. Regarding the second consideration, some rearrangement events are widely studied such as reversals and transpositions. In this paper, we shed light on the first consideration and created a model that takes into account gene order and the number of nucleotides in intergenic regions. In addition, we consider events of reversals, transpositions, and indels (insertions and deletions) of genomic material. We present a 4-approximation algorithm for reversals and indels, a [Formula: see text]-approximation algorithm for transpositions and indels, and a 6-approximation for reversals, transpositions, and indels.
Assuntos
Genoma , Modelos Genéticos , Algoritmos , DNA Intergênico/genética , Rearranjo Gênico , GenômicaRESUMO
In the field of comparative genomics, one way of comparing two genomes is through the analysis of how they distinguish themselves based on a set of mutations called rearrangement events. When considering that genomes undergo different types of rearrangements, it can be assumed that some events are more common than others. To model this assumption, one can assign different weights to different events, where common events tend to cost less than others. However, this approach, called weighted, does not guarantee that the rearrangement assumed to be the most frequent will be also the most frequently returned by proposed algorithms. To overcome this issue, we investigate a new problem where we seek the shortest sequence of rearrangement events able to transform one genome into the other, with a restriction regarding the proportion between the events returned. Here, we consider two rearrangement events: reversal, that inverts the order and the orientation of the genes inside a segment of the genome, and transposition, that moves a segment of the genome to another position. We show the complexity of this problem for any desired proportion considering scenarios where the orientation of the genes is known or unknown. We also develop an approximation algorithm with a constant approximation factor for each scenario and, in particular, we describe an improved (asymptotic) approximation algorithm for the case where the gene orientation is known. At last, we present the experimental tests comparing the proposed algorithms with others from the literature for the version of the problem without the proportion restriction.
Assuntos
Genoma , Genômica , Algoritmos , Rearranjo Gênico , Modelos Genéticos , MutaçãoRESUMO
The rearrangement distance is a well-known problem in the field of comparative genomics. Given two genomes, the rearrangement distance is the minimum number of rearrangements in a set of allowed rearrangements (rearrangement model), which transforms one genome into the other. In rearrangement distance problems, a genome is modeled as a string, where each element represents a conserved region within the two genomes. When the orientation of the genes is known, it is represented by (plus or minus) signs assigned to the elements of the string. Two of the most studied rearrangements are reversals, which invert a segment of the genome, and transpositions, which exchange the relative positions of two adjacent segments of the genome. The first works in genome rearrangements considered that the genomes being compared had the same genetic material and that rearrangement events were restricted to reversals, transpositions, or both. El-Mabrouk extended the reversal model on signed strings to include the operations of insertion and deletion of segments in the genome, which allowed the comparison of genomes with different genetic material. Other studies also addressed this problem and, recently, this problem was proved to be solvable in polynomial time by Willing et al. For unsigned strings, we still observe a lack of results. That said, in this study we prove that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions and indels on unsigned strings; and reversals, transpositions, and indels on signed and unsigned strings. Along with the NP-hardness proofs, we present a 2-approximation algorithm for reversals on unsigned strings and 3-approximation algorithms for the other models.
Assuntos
Rearranjo Gênico/genética , Genoma/genética , Mutação INDEL/genética , Algoritmos , Genômica/métodos , Modelos GenéticosRESUMO
During the evolutionary process, genomes are affected by various genome rearrangements, that is, events that modify large stretches of the genetic material. In the literature, a large number of models have been proposed to estimate the number of events that occurred during evolution; most of them represent a genome as an ordered sequence of genes, and, in particular, disregard the genetic material between consecutive genes. However, recent studies showed that taking into account the genetic material between consecutive genes can enhance evolutionary distance estimations. Reversal and transposition are genome rearrangements that have been widely studied in the literature. A reversal inverts a (contiguous) segment of the genome, while a transposition swaps the positions of two consecutive segments. Genomes also undergo nonconservative events (events that alter the amount of genetic material) such as insertions and deletions, in which genetic material from intergenic regions of the genome is inserted or deleted, respectively. In this article, we study a genome rearrangement model that considers both gene order and sizes of intergenic regions. We investigate the reversal distance, and also the reversal and transposition distance between two genomes in two scenarios: with and without nonconservative events. We show that these problems are NP-hard and we present constant ratio approximation algorithms for all of them. More precisely, we provide a 4-approximation algorithm for the reversal distance, both in the conservative and nonconservative versions. For the reversal and transposition distance, we provide a 4.5-approximation algorithm, both in the conservative and nonconservative versions. We also perform experimental tests to verify the behavior of our algorithms, as well as to compare the practical and theoretical results. We finally extend our study to scenarios in which events have different costs, and we present constant ratio approximation algorithms for each scenario.
RESUMO
BACKGROUND: The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called genome rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the presence of sequences of nucleotides between genes, called intergenic regions. RESULTS AND CONCLUSIONS: In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called super short reversals (SSRs) and super short transpositions (SSTs), which affect up to two (consecutive) genes. We denote by super short operations (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2.
RESUMO
In comparative genomics, rearrangements are mutations that affect a stretch of DNA sequences. Reversals and transpositions are well-known rearrangements, and each has a vast literature. The reversal and transposition distance, that is, the minimum number of reversals and transpositions needed to transform one genome into another is a relevant evolutionary distance. The problem of computing this distance when genomes are represented by permutations was proposed >20 years ago and received the name of sorting by reversals and transpositions problem. It has been the focus of a number of studies, but the computational complexity has remained open until now. We hereby solve this question and prove that it is NP-hard no matter whether genomes are represented by signed or unsigned permutations. In addition, we prove that a usual generalization of this problem, which assigns weights wρ for reversals and wτ for transpositions, is also NP-hard as long as wτ/wρ ≤ 1.5 for both signed and unsigned permutations.
Assuntos
Sequência de Bases/genética , Biologia Computacional/métodos , Genômica/métodos , Algoritmos , Rearranjo Gênico , Genoma/genética , Mutação/genéticaRESUMO
Genome rearrangements are global mutations that change large stretches of DNA sequence throughout genomes. They are rare but accumulate during the evolutionary process leading to organisms with similar genetic material in different places and orientations within the genome. Sorting by Genome Rearrangements problems seek for minimum-length sequences of rearrangements that transform one genome into the other. These problems accept alternative versions that assign weights for each event, and the goal is to find a minimum-weight sequence. We study the Sorting by Weighted Reversals and Transpositions problem on signed permutations. In this study, we use weight 2 for reversals and 3 for transpositions and consider theoretical and practical aspects in our analysis. We present two algorithms with approximation factors of 5/3 and 3/2. We also developed a generic approximation algorithm to deal with different weights for reversals and transpositions, and we show the approximation factor reached in each scenario.
Assuntos
Rearranjo Gênico/genética , Algoritmos , Genoma/genética , Genômica/métodos , Modelos Genéticos , Mutação/genéticaRESUMO
BACKGROUND: One way to estimate the evolutionary distance between two given genomes is to determine the minimum number of large-scale mutations, or genome rearrangements, that are necessary to transform one into the other. In this context, genomes can be represented as ordered sequences of genes, each gene being represented by a signed integer. If no gene is repeated, genomes are thus modeled as signed permutations of the form π=(π1π2 πn) , and in that case we can consider without loss of generality that one of them is the identity permutation ιn=(12 n) , and that we just need to sort the other (i.e., transform it into ιn ). The most studied genome rearrangement events are reversals, where a segment of the genome is reversed and reincorporated at the same location; and transpositions, where two consecutive segments are exchanged. Many variants, e.g., combining different types of (possibly constrained) rearrangements, have been proposed in the literature. One of them considers that the number of genes involved, in a reversal or a transposition, is never greater than two, which is known as the problem of sorting by super short operations (or SSOs). RESULTS AND CONCLUSIONS: All problems considering SSOs in permutations have been shown to be in P , except for one, namely sorting signed circular permutations by super short reversals and super short transpositions. Here we fill this gap by introducing a new graph structure called cyclic permutation graph and providing a series of intermediate results, which allows us to design a polynomial algorithm for sorting signed circular permutations by super short reversals and super short transpositions.
RESUMO
In this paper, we present a general heuristic for several problems in the genome rearrangement field. Our heuristic does not solve any problem directly, it is rather used to improve the solutions provided by any non-optimal algorithm that solve them. Therefore, we have implemented several algorithms described in the literature and several algorithms developed by ourselves. As a whole, we implemented 23 algorithms for 9 well known problems in the genome rearrangement field. A total of 13 algorithms were implemented for problems that use the notions of prefix and suffix operations. In addition, we worked on 5 algorithms for the classic problem of sorting by transposition and we conclude the experiments by presenting results for 3 approximation algorithms for the sorting by reversals and transpositions problem and 2 approximation algorithms for the sorting by reversals problem. Another algorithm with better approximation ratio can be found for the last genome rearrangement problem, but it is purely theoretical with no practical implementation. The algorithms we implemented in addition to our heuristic lead to the best practical results in each case. In particular, we were able to improve results on the sorting by transpositions problem, which is a very special case because many efforts have been made to generate algorithms with good results in practice and some of these algorithms provide results that equal the optimum solutions in many cases. Our source codes and benchmarks are freely available upon request from the authors so that it will be easier to compare new approaches against our results.
Assuntos
Algoritmos , Rearranjo Gênico , Modelos Genéticos , Biologia Computacional , Elementos de DNA Transponíveis/genética , Bases de Dados Genéticas , Mutação , Inversão de SequênciaRESUMO
Transpositions are large-scale mutational events that occur when a block of genes moves from a region of a chromosome to another region within the same chromosome. The transposition distance problem is the minimum number of transpositions required to transform one genome into another. Recently, Bulteau et al. [Bulteau L, Fertin G, Rusu U, Automata, Languages and Programming, Vol. 6755 of Lecture Notes in Computer Science, pp. 654-665, Springer Berlin, Heidelberg, 2011] proved that finding the transposition distance is a NP-Hard problem. Some approximation algorithm for this problem have been presented to date [Bafna V, Pevzner PA, SIAM J Discr Math11(2):224-240, 1998; Elias I, Hartman T, IEEE/ACM Trans Comput Biol Bioinform3(4):369-379, 2006; Mira CVG, Dias Z, Santos HP, Pinto GA, Walter ME, Proc 3rd Brazilian Symp Bioinformatics (BSB'2008), pp. 115-126, Santo André, Brazil, 2008; Walter MEMT, Dias Z, Meidanis J, Proc String Processing and Information Retrieval (SPIRE'2000), pp. 199-208, Coruña, Spain, 2000]. Here we focus on developing heuristics to provide an improved approximated solution. Our approach outperforms other algorithms on small sized permutations. We also show that our algorithm keeps the good performance on longer permutations.
Assuntos
Algoritmos , Elementos de DNA Transponíveis/genética , Modelos Genéticos , Biologia Computacional , Rearranjo Gênico , Genômica/estatística & dados numéricos , Modelos EstatísticosRESUMO
BACKGROUND: Decreasing costs of DNA sequencing have made prokaryotic draft genome sequences increasingly common. A contig scaffold is an ordering of contigs in the correct orientation. A scaffold can help genome comparisons and guide gap closure efforts. One popular technique for obtaining contig scaffolds is to map contigs onto a reference genome. However, rearrangements that may exist between the query and reference genomes may result in incorrect scaffolds, if these rearrangements are not taken into account. Large-scale inversions are common rearrangement events in prokaryotic genomes. Even in draft genomes it is possible to detect the presence of inversions given sufficient sequencing coverage and a sufficiently close reference genome. RESULTS: We present a linear-time algorithm that can generate a set of contig scaffolds for a draft genome sequence represented in contigs given a reference genome. The algorithm is aimed at prokaryotic genomes and relies on the presence of matching sequence patterns between the query and reference genomes that can be interpreted as the result of large-scale inversions; we call these patterns inversion signatures. Our algorithm is capable of correctly generating a scaffold if at least one member of every inversion signature pair is present in contigs and no inversion signatures have been overwritten in evolution. The algorithm is also capable of generating scaffolds in the presence of any kind of inversion, even though in this general case there is no guarantee that all scaffolds in the scaffold set will be correct. We compare the performance of sis, the program that implements the algorithm, to seven other scaffold-generating programs. The results of our tests show that sis has overall better performance. CONCLUSIONS: sis is a new easy-to-use tool to generate contig scaffolds, available both as stand-alone and as a web server. The good performance of sis in our tests adds evidence that large-scale inversions are widespread in prokaryotic genomes.