RESUMEN
Gene regulatory networks are graph models representing cellular transcription events. Networks are far from complete due to time and resource consumption for experimental validation and curation of the interactions. Previous assessments have shown the modest performance of the available network inference methods based on gene expression data. Here, we study several caveats on the inference of regulatory networks and methods assessment through the quality of the input data and gold standard, and the assessment approach with a focus on the global structure of the network. We used synthetic and biological data for the predictions and experimentally-validated biological networks as the gold standard (ground truth). Standard performance metrics and graph structural properties suggest that methods inferring co-expression networks should no longer be assessed equally with those inferring regulatory interactions. While methods inferring regulatory interactions perform better in global regulatory network inference than co-expression-based methods, the latter is better suited to infer function-specific regulons and co-regulation networks. When merging expression data, the size increase should outweigh the noise inclusion and graph structure should be considered when integrating the inferences. We conclude with guidelines to take advantage of inference methods and their assessment based on the applications and available expression datasets.
RESUMEN
Context: Inferring gene regulatory networks (GRN) from high-throughput gene expression data is a challenging task for which different strategies have been developed. Nevertheless, no ever-winning method exists, and each method has its advantages, intrinsic biases, and application domains. Thus, in order to analyze a dataset, users should be able to test different techniques and choose the most appropriate one. This step can be particularly difficult and time consuming, since most methods' implementations are made available independently, possibly in different programming languages. The implementation of an open-source library containing different inference methods within a common framework is expected to be a valuable toolkit for the systems biology community. Results: In this work, we introduce GReNaDIne (Gene Regulatory Network Data-driven Inference), a Python package that implements 18 machine learning data-driven gene regulatory network inference methods. It also includes eight generalist preprocessing techniques, suitable for both RNA-seq and microarray dataset analysis, as well as four normalization techniques dedicated to RNA-seq. In addition, this package implements the possibility to combine the results of different inference tools to form robust and efficient ensembles. This package has been successfully assessed under the DREAM5 challenge benchmark dataset. The open-source GReNaDIne Python package is made freely available in a dedicated GitLab repository, as well as in the official third-party software repository PyPI Python Package Index. The latest documentation on the GReNaDIne library is also available at Read the Docs, an open-source software documentation hosting platform. Contribution: The GReNaDIne tool represents a technological contribution to the field of systems biology. This package can be used to infer gene regulatory networks from high-throughput gene expression data using different algorithms within the same framework. In order to analyze their datasets, users can apply a battery of preprocessing and postprocessing tools and choose the most adapted inference method from the GReNaDIne library and even combine the output of different methods to obtain more robust results. The results format provided by GReNaDIne is compatible with well-known complementary refinement tools such as PYSCENIC.
Asunto(s)
Biología Computacional , Redes Reguladoras de Genes , Biología Computacional/métodos , San Vicente y las Grenadinas , Programas Informáticos , Expresión GénicaRESUMEN
Corynebacterium glutamicum is a Gram-positive bacterium found in soil where the condition changes demand plasticity of the regulatory machinery. The study of such machinery at the global scale has been challenged by the lack of data integration. Here, we report three regulatory network models for C. glutamicum: strong (3040 interactions) constructed solely with regulations previously supported by directed experiments; all evidence (4665 interactions) containing the strong network, regulations previously supported by nondirected experiments, and protein-protein interactions with a direct effect on gene transcription; sRNA (5222 interactions) containing the all evidence network and sRNA-mediated regulations. Compared to the previous version (2018), the strong and all evidence networks increased by 75 and 1225 interactions, respectively. We analyzed the system-level components of the three networks to identify how they differ and compared their structures against those for the networks of more than 40 species. The inclusion of the sRNA-mediated regulations changed the proportions of the system-level components and increased the number of modules but decreased their size. The C. glutamicum regulatory structure contrasted with other bacterial regulatory networks. Finally, we used the strong networks of three model organisms to provide insights and future directions of the C.glutamicum regulatory network characterization.
RESUMEN
Lead poisoning effects are wide and include nervous system impairment, peculiarly during development, leading to neural damage. Lead interaction with calcium and zinc-containing metalloproteins broadly affects cellular metabolism since these proteins are related to intracellular ion balance, activation of signaling transduction cascades, and gene expression regulation. In spite of lead being recognized as a neurotoxin, there are gaps in knowledge about the global effect of lead in modulating the transcription of entire cellular systems in neural cells. In order to investigate the effects of lead poisoning in a systemic perspective, we applied the transcriptogram methodology in an RNA-seq dataset of human embryonic-derived neural progenitor cells (ES-NP cells) treated with 30 µM lead acetate for 26 days. We observed early downregulation of several cellular systems involved with cell differentiation, such as cytoskeleton organization, RNA, and protein biosynthesis. The downregulated cellular systems presented big and tightly connected networks. For long treatment times (12 to 26 days), it was possible to observe a massive impairment in cell transcription profile. Taking the enriched terms together, we observed interference in all layers of gene expression regulation, from chromatin remodeling to vesicle transport. Considering that ES-NP cells are progenitor cells that can originate other neural cell types, our results suggest that lead-induced gene expression disturbance might impair cells' ability to differentiate, therefore influencing ES-NP cells' fate.
RESUMEN
Gene network (GN) inference from temporal gene expression data is a crucial and challenging problem in systems biology. Expression data sets usually consist of dozens of temporal samples, while networks consist of thousands of genes, thus rendering many inference methods unfeasible in practice. To improve the scalability of GN inference methods, we propose a novel framework called GeNICE, based on probabilistic GNs; the main novelty is the introduction of a clustering procedure to group genes with related expression profiles and to provide an approximate solution with reduced computational complexity. We use the defined clusters to perform an exhaustive search to retrieve the best predictor gene subsets for each target gene, according to multivariate criterion functions. GeNICE greatly reduces the search space because predictor candidates are restricted to one gene per cluster. Finally, a multivariate analysis is performed for each defined predictor subset to retrieve minimal subsets and to simplify the network. In our experiments with in silico generated data sets, GeNICE achieved substantial computational time reduction when compared to solutions without the clustering step, while preserving the gene expression prediction accuracy even when the number of clusters is small (about 50) relative to the number of genes (order of thousands). For a Plasmodium falciparum microarray data set, the prediction accuracy achieved by GeNICE was roughly 97%, while the respective topologies involving glycolytic and apicoplast seed genes had a very large intramodularity, very small interconnection between modules, and some module hub genes, reflecting small-world and scale-free topological properties, as expected.