Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Más filtros











Base de datos
Intervalo de año de publicación
1.
PLoS One ; 7(3): e32235, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22461885

RESUMEN

A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.


Asunto(s)
Biología Computacional/métodos , Redes Neurales de la Computación , Estructura Secundaria de Proteína , Proteínas/química , Algoritmos , Sitios de Unión , Proteínas de la Membrana/química , Reproducibilidad de los Resultados
2.
Mol Cell Proteomics ; 11(2): M111.012161, 2012 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-22052992

RESUMEN

The goal of many shotgun proteomics experiments is to determine the protein complement of a complex biological mixture. For many mixtures, most methodological approaches fall significantly short of this goal. Existing solutions to this problem typically subdivide the task into two stages: first identifying a collection of peptides with a low false discovery rate and then inferring from the peptides a corresponding set of proteins. In contrast, we formulate the protein identification problem as a single optimization problem, which we solve using machine learning methods. This approach is motivated by the observation that the peptide and protein level tasks are cooperative, and the solution to each can be improved by using information about the solution to the other. The resulting algorithm directly controls the relevant error rate, can incorporate a wide variety of evidence and, for complex samples, provides 18-34% more protein identifications than the current state of the art approaches.


Asunto(s)
Inteligencia Artificial , Mezclas Complejas/análisis , Modelos Estadísticos , Proteínas/análisis , Proteómica , Espectrometría de Masas en Tándem/métodos , Algoritmos , Líquido Amniótico/química , Líquido Amniótico/metabolismo , Proteínas de Caenorhabditis elegans/metabolismo , Bases de Datos de Proteínas , Humanos , Reflujo Laringofaríngeo , Fragmentos de Péptidos/análisis , Proteínas de Saccharomyces cerevisiae/metabolismo , Programas Informáticos
3.
PLoS Comput Biol ; 7(1): e1001047, 2011 Jan 27.
Artículo en Inglés | MEDLINE | ID: mdl-21298082

RESUMEN

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.


Asunto(s)
Evolución Biológica , Proteínas/genética , Algoritmos , Proteínas/química , Análisis de Secuencia de ADN
4.
Bioinformatics ; 26(18): i645-52, 2010 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-20823334

RESUMEN

MOTIVATION: Protein-protein interactions (PPIs) are critical for virtually every biological function. Recently, researchers suggested to use supervised learning for the task of classifying pairs of proteins as interacting or not. However, its performance is largely restricted by the availability of truly interacting proteins (labeled). Meanwhile, there exists a considerable amount of protein pairs where an association appears between two partners, but not enough experimental evidence to support it as a direct interaction (partially labeled). RESULTS: We propose a semi-supervised multi-task framework for predicting PPIs from not only labeled, but also partially labeled reference sets. The basic idea is to perform multi-task learning on a supervised classification task and a semi-supervised auxiliary task. The supervised classifier trains a multi-layer perceptron network for PPI predictions from labeled examples. The semi-supervised auxiliary task shares network layers of the supervised classifier and trains with partially labeled examples. Semi-supervision could be utilized in multiple ways. We tried three approaches in this article, (i) classification (to distinguish partial positives with negatives); (ii) ranking (to rate partial positive more likely than negatives); (iii) embedding (to make data clusters get similar labels). We applied this framework to improve the identification of interacting pairs between HIV-1 and human proteins. Our method improved upon the state-of-the-art method for this task indicating the benefits of semi-supervised multi-task learning using auxiliary information. AVAILABILITY: http://www.cs.cmu.edu/~qyj/HIVsemi.


Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , VIH-1/fisiología , Proteínas del Virus de la Inmunodeficiencia Humana/metabolismo , Mapeo de Interacción de Proteínas/métodos , Proteínas/metabolismo , Algoritmos , Interpretación Estadística de Datos , Humanos , Modelos Estadísticos
5.
Methods Mol Biol ; 609: 223-39, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-20221922

RESUMEN

The Support Vector Machine (SVM) is a widely used classifier in bioinformatics. Obtaining the best results with SVMs requires an understanding of their workings and the various ways a user can influence their accuracy. We provide the user with a basic understanding of the theory behind SVMs and focus on their use in practice. We describe the effect of the SVM parameters on the resulting classifier, how to select good values for those parameters, data normalization, factors that affect training time, and software for training SVMs.


Asunto(s)
Inteligencia Artificial , Biología Computacional , Minería de Datos , Bases de Datos Factuales , Algoritmos , Modelos Lineales , Modelos Estadísticos , Dinámicas no Lineales , Distribución Normal , Programas Informáticos
6.
PLoS One ; 4(7): e6393, 2009 Jul 28.
Artículo en Inglés | MEDLINE | ID: mdl-19636432

RESUMEN

To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA ("Semantic Extraction using a Neural Network Architecture"), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, co-occurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.


Asunto(s)
Indización y Redacción de Resúmenes , Redes Neurales de la Computación , Algoritmos
7.
J Proteome Res ; 8(7): 3737-45, 2009 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-19385687

RESUMEN

Shotgun proteomics coupled with database search software allows the identification of a large number of peptides in a single experiment. However, some existing search algorithms, such as SEQUEST, use score functions that are designed primarily to identify the best peptide for a given spectrum. Consequently, when comparing identifications across spectra, the SEQUEST score function Xcorr fails to discriminate accurately between correct and incorrect peptide identifications. Several machine learning methods have been proposed to address the resulting classification task of distinguishing between correct and incorrect peptide-spectrum matches (PSMs). A recent example is Percolator, which uses semisupervised learning and a decoy database search strategy to learn to distinguish between correct and incorrect PSMs identified by a database search algorithm. The current work describes three improvements to Percolator. (1) Percolator's heuristic optimization is replaced with a clear objective function, with intuitive reasons behind its choice. (2) Tractable nonlinear models are used instead of linear models, leading to improved accuracy over the original Percolator. (3) A method, Q-ranker, for directly optimizing the number of identified spectra at a specified q value is proposed, which achieves further gains.


Asunto(s)
Espectrometría de Masas/métodos , Péptidos/química , Proteómica/métodos , Algoritmos , Animales , Inteligencia Artificial , Quimotripsina/química , Biología Computacional/métodos , Bases de Datos de Proteínas , Modelos Estadísticos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Tripsina/química
8.
Bioinformatics ; 25(1): 121-2, 2009 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-18990723

RESUMEN

UNLABELLED: We present a large-scale implementation of the Rankprop protein homology ranking algorithm in the form of an openly accessible web server. We use the NRDB40 PSI-BLAST all-versus-all protein similarity network of 1.1 million proteins to construct the graph for the Rankprop algorithm, whereas previously, results were only reported for a database of 108 000 proteins. We also describe two algorithmic improvements to the original algorithm, including propagation from multiple homologs of the query and better normalization of ranking scores, that lead to higher accuracy and to scores with a probabilistic interpretation. AVAILABILITY: The Rankprop web server and source code are available at http://rankprop.gs.washington.edu


Asunto(s)
Algoritmos , Biología Computacional/métodos , Internet , Homología Estructural de Proteína , Bases de Datos de Proteínas , Curva ROC
9.
BMC Bioinformatics ; 9: 389, 2008 Sep 22.
Artículo en Inglés | MEDLINE | ID: mdl-18808707

RESUMEN

BACKGROUND: Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage. RESULTS: In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures. The method combines a full-coverage but lower accuracy nearest neighbor method with higher accuracy but reduced coverage multiclass SVMs to produce a full coverage classifier with overall improved accuracy. The hybrid approach is based on the simple idea of "punting" from one method to another using a learned threshold. CONCLUSION: In cross-validated experiments on the SCOP hierarchy, the hybrid methods consistently outperform the individual component methods at all levels of coverage. Code and data sets are available at http://noble.gs.washington.edu/proj/sabretooth.


Asunto(s)
Algoritmos , Proteínas/química , Proteínas/ultraestructura , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Datos de Secuencia Molecular , Proteínas/clasificación , Proteínas/metabolismo , Relación Estructura-Actividad , Integración de Sistemas
10.
Nat Methods ; 4(11): 923-5, 2007 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-17952086

RESUMEN

Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.


Asunto(s)
Inteligencia Artificial , Fragmentos de Péptidos/análisis , Proteómica/métodos , Espectrometría de Masas en Tándem/métodos , Algoritmos , Quimotripsina/análisis , Quimotripsina/química , Bases de Datos de Proteínas , Elastasa Pancreática/análisis , Elastasa Pancreática/química , Proteoma/análisis , Proteoma/química , Proteínas de Saccharomyces cerevisiae/análisis , Proteínas de Saccharomyces cerevisiae/química , Programas Informáticos , Tripsina/análisis , Tripsina/química
11.
BMC Bioinformatics ; 8 Suppl 4: S2, 2007 May 22.
Artículo en Inglés | MEDLINE | ID: mdl-17570145

RESUMEN

BACKGROUND: Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. RESULTS: We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at http://svm-fold.c2b2.columbia.edu. Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. CONCLUSION: By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.


Asunto(s)
Algoritmos , Inteligencia Artificial , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Secuencia de Aminoácidos , Análisis Discriminante , Internet , Datos de Secuencia Molecular , Pliegue de Proteína , Proteínas/clasificación , Homología de Secuencia de Aminoácido
12.
BMC Bioinformatics ; 7 Suppl 1: S10, 2006 Mar 20.
Artículo en Inglés | MEDLINE | ID: mdl-16723003

RESUMEN

BACKGROUND: Biologists regularly search DNA or protein databases for sequences that share an evolutionary or functional relationship with a given query sequence. Traditional search methods, such as BLAST and PSI-BLAST, focus on detecting statistically significant pairwise sequence alignments and often miss more subtle sequence similarity. Recent work in the machine learning community has shown that exploiting the global structure of the network defined by these pairwise similarities can help detect more remote relationships than a purely local measure. METHODS: We review RankProp, a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges. The original RankProp algorithm is unsupervised. Here, we describe a semi-supervised version of the algorithm that uses labeled examples. Three possible ways of incorporating label information are considered: (i) as a validation set for model selection, (ii) to learn a new network, by choosing which transfer function to use for a given query, and (iii) to estimate edge weights, which measure the probability of inferring structural similarity. RESULTS: Benchmarked on a human-curated database of protein structures, the original RankProp algorithm provides significant improvement over local network search algorithms such as PSI-BLAST. Furthermore, we show here that labeled data can be used to learn a network without any need for estimating parameters of the transfer function, and that diffusion on this learned network produces better results than the original RankProp algorithm with a fixed network. CONCLUSION: In order to gain maximal information from a network, labeled and unlabeled data should be used to extract both local and global structure.


Asunto(s)
Biología Computacional/métodos , Proteómica/métodos , Algoritmos , ADN/química , Humanos , Redes Neurales de la Computación , Conformación Proteica , Proteínas/química , Reproducibilidad de los Resultados , Análisis de Secuencia de Proteína , Programas Informáticos
13.
FEBS J ; 272(20): 5119-28, 2005 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-16218946

RESUMEN

Perhaps the most widely used applications of bioinformatics are tools such as psi-blast for searching sequence databases. We describe a recently developed protein database search algorithm called rankprop. rankprop relies upon a precomputed network of pairwise protein similarities. The algorithm performs a diffusion operation from a specified query protein across the protein similarity network. The resulting activation scores, assigned to each database protein, encode information about the global structure of the protein similarity network. This type of algorithm has a rich history in associationist psychology, artificial intelligence and web search. We describe the rankprop algorithm and its relatives, and we provide evidence that the algorithm successfully improves upon the rankings produced by psi-blast.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Proteínas Bacterianas/genética , Bases de Datos de Proteínas , Internet , Fotorreceptores Microbianos/genética , Estructura Terciaria de Proteína/genética , Proteínas/genética , Curva ROC
14.
Bioinformatics ; 21(19): 3711-8, 2005 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-16076885

RESUMEN

MOTIVATION: Sequence similarity often suggests evolutionary relationships between protein sequences that can be important for inferring similarity of structure or function. The most widely-used pairwise sequence comparison algorithms for homology detection, such as BLAST and PSI-BLAST, often fail to detect less conserved remotely-related targets. RESULTS: In this paper, we propose a new general graph-based propagation algorithm called MotifProp to detect more subtle similarity relationships than pairwise comparison methods. MotifProp is based on a protein-motif network, in which edges connect proteins and the k-mer based motif features that they contain. We show that our new motif-based propagation algorithm can improve the ranking results over a base algorithm, such as PSI-BLAST, that is used to initialize the ranking. Despite the complex structure of the protein-motif network, MotifProp can be easily interpreted using the top-ranked motifs and motif-rich regions induced by the propagation, both of which are helpful for discovering conserved structural components in remote homologies.


Asunto(s)
Algoritmos , Secuencias de Aminoácidos , Modelos Químicos , Modelos Moleculares , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Simulación por Computador , Datos de Secuencia Molecular , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/análisis , Homología de Secuencia de Aminoácido
15.
Bioinformatics ; 21(15): 3241-7, 2005 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-15905279

RESUMEN

MOTIVATION: Building an accurate protein classification system depends critically upon choosing a good representation of the input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data--examples with known 3D structures, organized into structural classes--whereas in practice, unlabeled data are far more plentiful. RESULTS: In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods and at the same time achieving far greater computational efficiency. AVAILABILITY: Source code is available at www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot. The Spider matlab package is available at www.kyb.tuebingen.mpg.de/bs/people/spider. SUPPLEMENTARY INFORMATION: www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot.


Asunto(s)
Algoritmos , Inteligencia Artificial , Análisis por Conglomerados , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/química , Proteínas/clasificación , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Proteínas/análisis
16.
IEEE Trans Biomed Eng ; 51(6): 1003-10, 2004 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-15188871

RESUMEN

Designing a brain computer interface (BCI) system one can choose from a variety of features that may be useful for classifying brain activity during a mental task. For the special case of classifying electroencephalogram (EEG) signals we propose the usage of the state of the art feature selection algorithms Recursive Feature Elimination and Zero-Norm Optimization which are based on the training of support vector machines (SVM). These algorithms can provide more accurate solutions than standard filter methods for feature selection. We adapt the methods for the purpose of selecting EEG channels. For a motor imagery paradigm we show that the number of used channels can be reduced significantly without increasing the classification error. The resulting best channels agree well with the expected underlying cortical activity patterns during the mental tasks. Furthermore we show how time dependent task specific information can be visualized.


Asunto(s)
Algoritmos , Inteligencia Artificial , Corteza Cerebral/fisiología , Electroencefalografía/métodos , Potenciales Evocados Motores/fisiología , Interfaz Usuario-Computador , Análisis por Conglomerados , Mano/fisiología , Humanos , Masculino , Reconocimiento de Normas Patrones Automatizadas , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
17.
Proc Natl Acad Sci U S A ; 101(17): 6559-63, 2004 Apr 27.
Artículo en Inglés | MEDLINE | ID: mdl-15087500

RESUMEN

Biologists regularly search databases of DNA or protein sequences for evolutionary or functional relationships to a given query sequence. We describe a ranking algorithm that exploits the entire network structure of similarity relationships among proteins in a sequence database by performing a diffusion operation on a precomputed, weighted network. The resulting ranking algorithm, evaluated by using a human-curated database of protein structures, is efficient and provides significantly better rankings than a local network search algorithm such as psi-blast.


Asunto(s)
Proteínas/química , Algoritmos , Bases de Datos de Ácidos Nucleicos , Bases de Datos de Proteínas , Conformación Proteica , Proteínas/genética
18.
Bioinformatics ; 20(4): 467-76, 2004 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-14990442

RESUMEN

MOTIVATION: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. RESULTS: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies.


Asunto(s)
Inteligencia Artificial , Proteínas/química , Proteínas/clasificación , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Datos de Secuencia Molecular , Proteínas Nucleares/química , Proteínas Nucleares/clasificación , Reconocimiento de Normas Patrones Automatizadas , Fosfoproteínas Fosfatasas , Homología de Secuencia de Aminoácido
19.
Bioinformatics ; 19(6): 764-71, 2003 Apr 12.
Artículo en Inglés | MEDLINE | ID: mdl-12691989

RESUMEN

MOTIVATION: In drug discovery a key task is to identify characteristics that separate active (binding) compounds from inactive (non-binding) ones. An automated prediction system can help reduce resources necessary to carry out this task. RESULTS: Two methods for prediction of molecular bioactivity for drug design are introduced and shown to perform well in a data set previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001. The data is characterized by very few positive examples, a very large number of features (describing three-dimensional properties of the molecules) and rather different distributions between training and test data. Two techniques are introduced specifically to tackle these problems: a feature selection method for unbalanced data and a classifier which adapts to the distribution of the the unlabeled test data (a so-called transductive method). We show both techniques improve identification performance and in conjunction provide an improvement over using only one of the techniques. Our results suggest the importance of taking into account the characteristics in this data which may also be relevant in other problems of a similar type.


Asunto(s)
Algoritmos , Inteligencia Artificial , Diseño de Fármacos , Modelos Biológicos , Mapeo de Interacción de Proteínas/métodos , Receptores de Droga/metabolismo , Sitios de Unión , Bases de Datos de Proteínas , Sustancias Macromoleculares , Modelos Químicos , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas , Análisis de Componente Principal , Unión Proteica , Proteínas/química , Proteínas/metabolismo , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
20.
J Comput Biol ; 9(2): 401-11, 2002.
Artículo en Inglés | MEDLINE | ID: mdl-12015889

RESUMEN

In our attempts to understand cellular function at the molecular level, we must be able to synthesize information from disparate types of genomic data. We consider the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence comparisons. We demonstrate the application of the support vector machine (SVM) learning algorithm to this functional inference task. Our results suggest the importance of exploiting prior information about the heterogeneity of the data. In particular, we propose an SVM kernel function that is explicitly heterogeneous. In addition, we describe feature scaling methods for further exploiting prior knowledge of heterogeneity by giving each data type different weights.


Asunto(s)
Inteligencia Artificial , Perfilación de la Expresión Génica/estadística & datos numéricos , Filogenia , Algoritmos , Biología Computacional , Bases de Datos Genéticas , Genes Fúngicos , Análisis de Secuencia por Matrices de Oligonucleótidos/estadística & datos numéricos , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA