Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 42
Filtrar
1.
Funct Integr Genomics ; 24(5): 139, 2024 Aug 19.
Artículo en Inglés | MEDLINE | ID: mdl-39158621

RESUMEN

Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Aprendizaje Automático , Humanos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Aprendizaje Profundo
2.
Comput Struct Biotechnol J ; 21: 5382-5393, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38022693

RESUMEN

Analysis and interpretation of high-throughput transcriptional and chromatin accessibility data at single-cell (sc) resolution are still open challenges in the biomedical field. The existence of countless bioinformatics tools, for the different analytical steps, increases the complexity of data interpretation and the difficulty to derive biological insights. In this article, we present SCALA, a bioinformatics tool for analysis and visualization of single-cell RNA sequencing (scRNA-seq) and Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) datasets, enabling either independent or integrative analysis of the two modalities. SCALA combines standard types of analysis by integrating multiple software packages varying from quality control to the identification of distinct cell populations and cell states. Additional analysis options enable functional enrichment, cellular trajectory inference, ligand-receptor analysis, and regulatory network reconstruction. SCALA is fully parameterizable, presenting data in tabular format and producing publication-ready visualizations. The different available analysis modules can aid biomedical researchers in exploring, analyzing, and visualizing their data without any prior experience in coding. We demonstrate the functionality of SCALA through two use-cases related to TNF-driven arthritic mice, handling both scRNA-seq and scATAC-seq datasets. SCALA is developed in R, Shiny and JavaScript and is mainly available as a standalone version, while an online service of more limited capacity can be found at http://scala.pavlopouloslab.info or https://scala.fleming.gr.

3.
Stud Health Technol Inform ; 305: 194-197, 2023 Jun 29.
Artículo en Inglés | MEDLINE | ID: mdl-37386994

RESUMEN

The paper presents a current situation of the FHIR Genomics resource and an assessment of FAIR data usage and possible future directions. FHIR Genomics forges a path towards data interoperability. By integrating both the FAIR principles and the FHIR resources, we can achieve a higher standardization across healthcare data collection and a smoother data exchange. By exemplifying on the FHIR Genomics resource, we want to pave the way towards the integration of genomic data into an Obstetrics-Gynecology Information system as a future direction to be able to identify possible disease predisposition in fetus.


Asunto(s)
Ginecología , Obstetricia , Femenino , Embarazo , Humanos , Genómica , Recolección de Datos , Feto
4.
Front Immunol ; 14: 1146826, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37180102

RESUMEN

The human leukocyte antigen (HLA) locus plays a central role in adaptive immune function and has significant clinical implications for tissue transplant compatibility and allelic disease associations. Studies using bulk-cell RNA sequencing have demonstrated that HLA transcription may be regulated in an allele-specific manner and single-cell RNA sequencing (scRNA-seq) has the potential to better characterize these expression patterns. However, quantification of allele-specific expression (ASE) for HLA loci requires sample-specific reference genotyping due to extensive polymorphism. While genotype prediction from bulk RNA sequencing is well described, the feasibility of predicting HLA genotypes directly from single-cell data is unknown. Here we evaluate and expand upon several computational HLA genotyping tools by comparing predictions from human single-cell data to gold-standard, molecular genotyping. The highest 2-field accuracy averaged across all loci was 76% by arcasHLA and increased to 86% using a composite model of multiple genotyping tools. We also developed a highly accurate model (AUC 0.93) for predicting HLA-DRB345 copy number in order to improve genotyping accuracy of the HLA-DRB locus. Genotyping accuracy improved with read depth and was reproducible at repeat sampling. Using a metanalytic approach, we also show that HLA genotypes from PHLAT and OptiType can generate ASE ratios that are highly correlated (R2 = 0.8 and 0.94, respectively) with those derived from gold-standard genotyping.


Asunto(s)
Antígenos HLA , Transcriptoma , Humanos , Análisis de Secuencia de ADN , Antígenos HLA/genética , Antígenos de Histocompatibilidad Clase I/genética , Genotipo , Antígenos de Histocompatibilidad Clase II/genética
5.
Mol Ecol Resour ; 2022 Dec 02.
Artículo en Inglés | MEDLINE | ID: mdl-36458971

RESUMEN

Polyploids are cells or organisms with a genome consisting of more than two sets of homologous chromosomes. Polyploid plants have important traits that facilitate speciation and are thus often model systems for evolutionary, molecular ecology and agricultural studies. However, due to their unusual mode of inheritance and double-reduction, diploid models of population genetic analysis cannot properly be applied to autopolyploids. To overcome this problem, we developed a software package entitled vcfpop to perform a variety of population genetic analyses for autopolyploids, such as parentage analysis, analysis of molecular variance, principal coordinates analysis, hierarchical clustering analysis and Bayesian clustering. We used three data sets to evaluate the capability of vcfpop to analyse large data sets on a desktop computer. This software is freely available at http://github.com/huangkang1987/vcfpop.

6.
Front Bioinform ; 2: 842051, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36304305

RESUMEN

In eukaryotic cells, miRNAs regulate a plethora of cellular functionalities ranging from cellular metabolisms, and development to the regulation of biological networks and pathways, both under homeostatic and pathological states like cancer.Despite their immense importance as key regulators of cellular processes, accurate and reliable estimation of miRNAs using Next Generation Sequencing is challenging, largely due to the limited availability of robust computational tools/methods/pipelines. Here, we introduce miRPipe, an end-to-end computational framework for the identification, characterization, and expression estimation of small RNAs, including the known and novel miRNAs and previously annotated pi-RNAs from small-RNA sequencing profiles. Our workflow detects unique novel miRNAs by incorporating the sequence information of seed and non-seed regions, concomitant with clustering analysis. This approach allows reliable and reproducible detection of unique novel miRNAs and functionally same miRNAs (paralogues). We validated the performance of miRPipe with the available state-of-the-art pipelines using both synthetic datasets generated using the newly developed miRSim tool and three cancer datasets (Chronic Lymphocytic Leukemia, Lung cancer, and breast cancer). In the experiment over the synthetic dataset, miRPipe is observed to outperform the existing state-of-the-art pipelines (accuracy: 95.23% and F 1-score: 94.17%). Analysis on all the three cancer datasets shows that miRPipe is able to extract more number of known dysregulated miRNAs or piRNAs from the datasets as compared to the existing pipelines.

7.
Math Biosci Eng ; 19(8): 8505-8536, 2022 06 10.
Artículo en Inglés | MEDLINE | ID: mdl-35801475

RESUMEN

Single-cell sequencing technologies have revolutionized molecular and cellular biology and stimulated the development of computational tools to analyze the data generated from these technology platforms. However, despite the recent explosion of computational analysis tools, relatively few mathematical models have been developed to utilize these data. Here we compare and contrast two cell state geometries for building mathematical models of cell state-transitions with single-cell RNA-sequencing data with hematopoeisis as a model system; (i) by using partial differential equations on a graph representing intermediate cell states between known cell types, and (ii) by using the equations on a multi-dimensional continuous cell state-space. As an application of our approach, we demonstrate how the calibrated models may be used to mathematically perturb normal hematopoeisis to simulate, predict, and study the emergence of novel cell states during the pathogenesis of acute myeloid leukemia. We particularly focus on comparing the strength and weakness of the graph model and multi-dimensional model.


Asunto(s)
Modelos Biológicos , Modelos Teóricos , Análisis de Secuencia de ARN
8.
Front Genet ; 13: 1084974, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36733945

RESUMEN

Copy number variation (CNV) is one of the main structural variations in the human genome and accounts for a considerable proportion of variations. As CNVs can directly or indirectly cause cancer, mental illness, and genetic disease in humans, their effective detection in humans is of great interest in the fields of oncogene discovery, clinical decision-making, bioinformatics, and drug discovery. The advent of next-generation sequencing data makes CNV detection possible, and a large number of CNV detection tools are based on next-generation sequencing data. Due to the complexity (e.g., bias, noise, alignment errors) of next-generation sequencing data and CNV structures, the accuracy of existing methods in detecting CNVs remains low. In this work, we design a new CNV detection approach, called shortest path-based Copy number variation (SPCNV), to improve the detection accuracy of CNVs. SPCNV calculates the k nearest neighbors of each read depth and defines the shortest path, shortest path relation, and shortest path cost sets based on which further calculates the mean shortest path cost of each read depth and its k nearest neighbors. We utilize the ratio between the mean shortest path cost for each read depth and the mean of the mean shortest path cost of its k nearest neighbors to construct a relative shortest path score formula that is able to determine a score for each read depth. Based on the score profile, a boxplot is then applied to predict CNVs. The performance of the proposed method is verified by simulation data experiments and compared against several popular methods of the same type. Experimental results show that the proposed method achieves the best balance between recall and precision in each set of simulated samples. To further verify the performance of the proposed method in real application scenarios, we then select real sample data from the 1,000 Genomes Project to conduct experiments. The proposed method achieves the best F1-scores in almost all samples. Therefore, the proposed method can be used as a more reliable tool for the routine detection of CNVs.

9.
Clin Chem ; 68(2): 313-321, 2022 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-34871369

RESUMEN

BACKGROUND: To date, the usage of Galaxy, an open-source bioinformatics platform, has been reported primarily in research. We report 5 years' experience (2015 to 2020) with Galaxy in our hospital, as part of the "Assistance Publique-Hôpitaux de Paris" (AP-HP), to demonstrate its suitability for high-throughput sequencing (HTS) data analysis in a clinical laboratory setting. METHODS: Our Galaxy instance has been running since July 2015 and is used daily to study inherited diseases, cancer, and microbiology. For the molecular diagnosis of hereditary diseases, 6970 patients were analyzed with Galaxy (corresponding to a total of 7029 analyses). RESULTS: Using Galaxy, the time to process a batch of 23 samples-equivalent to a targeted DNA sequencing MiSeq run-from raw data to an annotated variant call file was generally less than 2 h for panels between 1 and 500 kb. Over 5 years, we only restarted the server twice for hardware maintenance and did not experience any significant troubles, demonstrating the robustness of our Galaxy installation in conjunction with HTCondor as a job scheduler and a PostgreSQL database. The quality of our targeted exome sequencing method was externally evaluated annually by the European Molecular Genetics Quality Network (EMQN). Sensitivity was mean (SD)% 99 (2)% for single nucleotide variants and 93 (9)% for small insertion-deletions. CONCLUSION: Our experience with Galaxy demonstrates it to be a suitable platform for HTS data analysis with vast potential to benefit patient care in a clinical laboratory setting.


Asunto(s)
Biología Computacional , Laboratorios Clínicos , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Análisis de Secuencia de ADN , Programas Informáticos
10.
Data Brief ; 39: 107607, 2021 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-34869809

RESUMEN

Messastrum gracile SE-MC4 is a non-model microalga exhibiting superior oil-accumulating abilities. However, biomass production in M. gracile SE-MC4 is limited due to low cell proliferation especially after prolonged cultivation under oil-inducing culture conditions. Present data consist of next generation RNA sequencing data of M. gracile SE-MC4 under exponential and stationary growth stages. RNA of six samples were extracted and sequenced with insert size of 100 bp paired-end strategy using BGISEQ-500 platform to produce a total of 59.64 Gb data with 314 million reads. Sequences were filtered and de novo assembled to form 53,307 number of gene sequences. Sequencing data were deposited in National Center for Biotechnology Information (NCBI) and can be accessed via BioProject ID PRJNA552165. This information can be used to enhance biomass production in M. gracile SE-MC4 and other microalgae aimed towards improving biodiesel development.

11.
Curr Issues Mol Biol ; 43(3): 1937-1949, 2021 Nov 06.
Artículo en Inglés | MEDLINE | ID: mdl-34889894

RESUMEN

The worldwide emergence and spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) since 2019 has highlighted the importance of rapid and reliable diagnostic testing to prevent and control the viral transmission. However, inaccurate results may occur due to false negatives (FN) caused by polymorphisms or point mutations related to the virus evolution and compromise the accuracy of the diagnostic tests. Therefore, PCR-based SARS-CoV-2 diagnostics should be evaluated and evolve together with the rapidly increasing number of new variants appearing around the world. However, even by using a large collection of samples, laboratories are not able to test a representative collection of samples that deals with the same level of diversity that is continuously evolving worldwide. In the present study, we proposed a methodology based on an in silico and in vitro analysis. First, we used all information offered by available whole-genome sequencing data for SARS-CoV-2 for the selection of the two PCR assays targeting two different regions in the genome, and to monitor the possible impact of virus evolution on the specificity of the primers and probes of the PCR assays during and after the development of the assays. Besides this first essential in silico evaluation, a minimal set of testing was proposed to generate experimental evidence on the method performance, such as specificity, sensitivity and applicability. Therefore, a duplex reverse-transcription droplet digital PCR (RT-ddPCR) method was evaluated in silico by using 154 489 whole-genome sequences of SARS-CoV-2 strains that were representative for the circulating strains around the world. The RT-ddPCR platform was selected as it presented several advantages to detect and quantify SARS-CoV-2 RNA in clinical samples and wastewater. Next, the assays were successfully experimentally evaluated for their sensitivity and specificity. A preliminary evaluation of the applicability of the developed method was performed using both clinical and wastewater samples.


Asunto(s)
Prueba de Ácido Nucleico para COVID-19/métodos , COVID-19/virología , Pruebas Diagnósticas de Rutina/métodos , Evolución Molecular , ARN Viral/genética , SARS-CoV-2/genética , COVID-19/diagnóstico , Humanos , Curva ROC , SARS-CoV-2/aislamiento & purificación
12.
Front Immunol ; 12: 688183, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34659196

RESUMEN

Background: High-precision human leukocyte antigen (HLA) genotyping is crucial for anti-cancer immunotherapy, but existing tools predicting HLA genotypes using next-generation sequencing (NGS) data are insufficiently accurate. Materials and Methods: We compared availability, accuracy, correction score, and complementary ratio of eight HLA genotyping tools (OptiType, HLA-HD, PHLAT, seq2HLA, arcasHLA, HLAscan, HLA*LA, and Kourami) using 1,005 cases from the 1000 Genomes Project data. We created a new HLA-genotyping algorithm combining tools based on the precision and the accuracy of tools' combinations. Then, we assessed the new algorithm's performance in 39 in-house samples with normal whole-exome sequencing (WES) data and polymerase chain reaction-sequencing-based typing (PCR-SBT) results. Results: Regardless of the type of tool, the calls presented by more than six tools concordantly showed high accuracy and precision. The accuracy of the group with at least six concordant calls was 100% (97/97) in HLA-A, 98.2% (112/114) in HLA-B, 97.3% (142/146) in HLA-C. The precision of the group with at least six concordant calls was over 98% in HLA-ABC. We additionally calculated the accuracy of the combination tools considering the complementary ratio of each tool and the accuracy of each tool, and the accuracy was over 98% in all groups with six or more concordant calls. We created a new algorithm that matches the above results. It was to select the HLA type if more than six out of eight tools presented a matched type. Otherwise, determine the HLA type experimentally through PCR-SBT. When we applied the new algorithm to 39 in-house cases, there were more than six matching calls in all HLA-A, B, and C, and the accuracy of these concordant calls was 100%. Conclusions: HLA genotyping accuracy using NGS data could be increased by combining the current HLA genotyping tools. This new algorithm could also be useful for preliminary screening to decide whether to perform an additional PCR-based experimental method instead of using tools with NGS data.


Asunto(s)
Algoritmos , Antígenos HLA/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Prueba de Histocompatibilidad , Histocompatibilidad/genética , Neoplasias/genética , Toma de Decisiones Clínicas , Bases de Datos Genéticas , Genotipo , Antígenos HLA/inmunología , Humanos , Inmunoterapia , Neoplasias/inmunología , Neoplasias/terapia , Fenotipo , Valor Predictivo de las Pruebas , Reproducibilidad de los Resultados , Programas Informáticos
13.
J Comput Biol ; 28(9): 880-891, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34375132

RESUMEN

In this article, we develop a new ℓ 0 -based sparse Poisson graphical model with applications to gene network inference from RNA-seq gene expression count data. Assuming a pair-wise Markov property, we propose to fit a separate broken adaptive ridge-regularized log-linear Poisson regression on each node to evaluate the conditional, instead of marginal, association between two genes in the presence of all other genes. The resulting sparse gene networks are generally more accurate than those generated by the ℓ 1 -regularized Poisson graphical model as demonstrated by our empirical studies. A real data illustration is given on a kidney renal clear cell carcinoma micro-RNA-seq data from the Cancer Genome Atlas.


Asunto(s)
Algoritmos , Modelos Lineales , Neoplasias/genética , Análisis de Secuencia de ARN/métodos , Carcinoma de Células Renales/genética , Gráficos por Computador , Regulación de la Expresión Génica , Humanos , Neoplasias Renales/genética , MicroARNs , Distribución de Poisson
14.
Front Genet ; 12: 642473, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34163521

RESUMEN

Copy number variation (CNV) is a genomic mutation that plays an important role in tumor evolution and tumor genesis. Accurate detection of CNVs from next-generation sequencing (NGS) data is still a challenging task due to artifacts such as uneven mapped reads and unbalanced amplitudes of gains and losses. This study proposes a new approach called HBOS-CNV to detect CNVs from NGS data. The central point of HBOS-CNV is that it uses a new statistic, the histogram-based outlier score (HBOS), to evaluate the fluctuation of genome bins to determine those of changed copy numbers. In comparison with existing statistics in the evaluation of CNVs, HBOS is a non-linearly transformed value from the observed read depth (RD) value of each genome bin, having the potential ability to relieve the effects resulted from the above artifacts. In the calculation of HBOS values, a dynamic width histogram is utilized to depict the density of bins on the genome being analyzed, which can reduce the effects of noises partially contributed by mapping and sequencing errors. The evaluation of genome bins using such a new statistic can lead to less extremely significant CNVs having a high probability of detection. We evaluated this method using a large number of simulation datasets and compared it with four existing methods (CNVnator, CNV-IFTV, CNV-LOF, and iCopyDav). The results demonstrated that our proposed method outperforms the others in terms of sensitivity, precision, and F1-measure. Furthermore, we applied the proposed method to a set of real sequencing samples from the 1000 Genomes Project and determined a number of CNVs with biological meanings. Thus, the proposed method can be regarded as a routine approach in the field of genome mutation analysis for cancer samples.

15.
Genome Biol ; 22(1): 75, 2021 03 05.
Artículo en Inglés | MEDLINE | ID: mdl-33673854

RESUMEN

Controlling quality of next-generation sequencing (NGS) data files is a necessary but complex task. To address this problem, we statistically characterize common NGS quality features and develop a novel quality control procedure involving tree-based and deep learning classification algorithms. Predictive models, validated on internal and external functional genomics datasets, are to some extent generalizable to data from unseen species. The derived statistical guidelines and predictive models represent a valuable resource for users of NGS data to better understand quality issues and perform automatic quality control. Our guidelines and software are available at https://github.com/salbrec/seqQscorer .


Asunto(s)
Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Aprendizaje Automático , Control de Calidad , Programas Informáticos , Algoritmos , Biología Computacional/normas , Bases de Datos Genéticas , Genómica/métodos , Genómica/normas , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Curva ROC , Reproducibilidad de los Resultados , Flujo de Trabajo
16.
Genet Epidemiol ; 45(1): 36-45, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-32864779

RESUMEN

The breakthroughs in next generation sequencing have allowed us to access data consisting of both common and rare variants, and in particular to investigate the impact of rare genetic variation on complex diseases. Although rare genetic variants are thought to be important components in explaining genetic mechanisms of many diseases, discovering these variants remains challenging, and most studies are restricted to population-based designs. Further, despite the shift in the field of genome-wide association studies (GWAS) towards studying rare variants due to the "missing heritability" phenomenon, little is known about rare X-linked variants associated with complex diseases. For instance, there is evidence that X-linked genes are highly involved in brain development and cognition when compared with autosomal genes; however, like most GWAS for other complex traits, previous GWAS for mental diseases have provided poor resources to deal with identification of rare variant associations on X-chromosome. In this paper, we address the two issues described above by proposing a method that can be used to test X-linked variants using sequencing data on families. Our method is much more general than existing methods, as it can be applied to detect both common and rare variants, and is applicable to autosomes as well. Our simulation study shows that the method is efficient, and exhibits good operational characteristics. An application to the University of Miami Study on Genetics of Autism and Related Disorders also yielded encouraging results.


Asunto(s)
Genes Ligados a X , Estudio de Asociación del Genoma Completo , Variación Genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Modelos Genéticos , Herencia Multifactorial
17.
Brief Bioinform ; 22(1): 55-65, 2021 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-32249310

RESUMEN

Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient's individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine's main objective-ensuring the optimum diagnosis, treatment and prognosis for each individual-investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data-and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).


Asunto(s)
Bases de Datos Genéticas , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Fenotipo , Medicina de Precisión/métodos , Predisposición Genética a la Enfermedad , Humanos , Secuenciación Completa del Genoma/métodos
18.
BMC Genomics ; 21(Suppl 6): 405, 2020 Dec 21.
Artículo en Inglés | MEDLINE | ID: mdl-33349236

RESUMEN

BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures. RESULTS: We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy. CONCLUSIONS: Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models.


Asunto(s)
Genómica , Aprendizaje Automático , Algoritmos , Análisis por Conglomerados , Biología Computacional , Humanos , Cuasiespecies
19.
Pathol Res Pract ; 216(9): 153051, 2020 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-32825935

RESUMEN

BACKGROUND: Neuroendocrine carcinomas (NECs) arise from neuroendocrine cells present throughout the body, and often present with metastases even with small and undetectable primary tumors. Additionally, neuroendocrine differentiation can be seen in carcinomas of non-neuroendocrine origin further complicating the landscape of metastatic NECs. Organ specific immunohistochemical markers such as TTF1, CDX2 and PAX8 are often lost in high grade tumors and may be non-contributory in localizing the primary site. Though NECs share a common cellular origin, they exhibit great variability in biologic behavior, prognosis and treatment based on the primary organ of origin. DESIGN: Twenty one cases of metastatic NECs were retrieved from our archives and were classified based on location of the primary tumor derived from clinical and radiological findings. Next generation sequencing data was retrieved and analyzed for recurrent genetic abnormalities in these cases. Statistical analysis was performed using IBM SPSS25 software. RESULTS: RB1 mutations were exclusive to NECs metastasizing from lung primary and were detected in 5 of 12 (41.6 %) cases (p = 0.04). CDKN gene family (CDKN1B and 2 A) mutations were limited to metatstatic NECs of non-pulmonary origin and were detected in 4 of 9 (44.4 %) cases (p = 0.02). CONCLUSION: The location of the primary tumor in metastatic NECs appears to have significant prognostic and therapeutic implications. But due to the morphological homogeneity, higher grade of tumor, variable sensitivity of immunohistochemical markers, and small, often undetectable primary tumors, the localization of the primary tumor in cases of metastatic NECs is a challenge. In this study, RB1 and CDKN gene family mutations are identified as possible markers for differentiating pulmonary and non-pulmonary origin in metatstatic NECs.


Asunto(s)
Carcinoma Neuroendocrino/genética , Quinasas Ciclina-Dependientes/genética , Tumores Neuroendocrinos/genética , Retinoblastoma/genética , Adulto , Anciano , Anciano de 80 o más Años , Biomarcadores de Tumor/análisis , Carcinoma Neuroendocrino/patología , Diferenciación Celular/fisiología , Proteínas Co-Represoras , Femenino , Humanos , Pulmón/patología , Masculino , Persona de Mediana Edad , Mutación/genética , Tumores Neuroendocrinos/patología , Pronóstico , Neoplasias de la Retina/genética , Neoplasias de la Retina/patología , Retinoblastoma/metabolismo
20.
Front Genet ; 11: 434, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32499814

RESUMEN

Copy number variation (CNV) is a very important phenomenon in tumor genomes and plays a significant role in tumor genesis. Accurate detection of CNVs has become a routine and necessary procedure for a deep investigation of tumor cells and diagnosis of tumor patients. Next-generation sequencing (NGS) technique has provided a wealth of data for the detection of CNVs at base-pair resolution. However, such task is usually influenced by a number of factors, including GC-content bias, sequencing errors, and correlations among adjacent positions within CNVs. Although many existing methods have dealt with some of these artifacts by designing their own strategies, there is still a lack of comprehensive consideration of all the factors. In this paper, we propose a new method, MFCNV, for an accurate detection of CNVs from NGS data. Compared with existing methods, the characteristics of the proposed method include the following: (1) it makes a full consideration of the intrinsic correlations among adjacent positions in the genome to be analyzed, (2) it calculates read depth, GC-content bias, base quality, and correlation value for each genome bin and combines them as multiple features for the evaluation of genome bins, and (3) it addresses the joint effect among the factors via training a neural network algorithm for the prediction of CNVs. We test the performance of the MFCNV method by using simulation and real sequencing data and make comparisons with several peer methods. The results demonstrate that our method is superior to other methods in terms of sensitivity, precision, and F1-score and can detect many CNVs that other methods have not discovered. MFCNV is expected to be a complementary tool in the analysis of mutations in tumor genomes and can be extended to be applied to the analysis of single-cell sequencing data.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA