Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 181
Filtrar
1.
Cortex ; 180: 18-34, 2024 Sep 11.
Artículo en Inglés | MEDLINE | ID: mdl-39305720

RESUMEN

There are recognized neuroimaging regions of interest in typical Alzheimer's disease which have been used to track disease progression and aid prognostication. However, there is a need for validated baseline imaging markers to predict clinical decline in atypical Alzheimer's Disease. We aimed to address this need by producing models from baseline imaging features using penalized regression and evaluating their predictive performance on various clinical measures. Baseline multimodal imaging data, in combination with clinical testing data at two time points from 46 atypical Alzheimer's Disease patients with a diagnosis of logopenic progressive aphasia (N = 24) or posterior cortical atrophy (N = 22), were used to generate our models. An additional 15 patients (logopenic progressive aphasia = 7, posterior cortical atrophy = 8), whose data were not used in our original analysis, were used to test our models. Patients underwent MRI, FDG-PET and Tau-PET imaging and a full neurologic battery at two time points. The Schaefer functional atlas was used to extract network-based and regional gray matter volume or PET SUVR values from baseline imaging. Penalized regression (Elastic Net) was used to create models to predict scores on testing at Time 2 while controlling for baseline performance, education, age, and sex. In addition, we created models using clinical or Meta Region of Interested (ROI) data to serve as comparisons. We found the degree of baseline involvement on neuroimaging was predictive of future performance on cognitive testing while controlling for the above measures on all three imaging modalities. In many cases, model predictability improved with the addition of network-based neuroimaging data to clinical data. We also found our network-based models performed superiorly to the comparison models comprised of only clinical or a Meta ROI score. Creating predictive models from imaging studies at a baseline time point that are agnostic to clinical diagnosis as we have described could prove invaluable in both the clinical and research setting, particularly in the development and implementation of future disease modifying therapies.

2.
Accid Anal Prev ; 207: 107759, 2024 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-39214036

RESUMEN

Crashes are frequently disproportionally observed in disadvantaged areas. Despite the evident disparities in transportation safety, there has been limited exploration of quantitative approaches to incorporating equity considerations into road safety management. This study proposes a novel concept of equity-aware safety performance functions (SPFs), enabling a distinct treatment of equity-related variables such as race and income. Equity-aware SPFs introduce a fairness distance and integrate it into the log-likelihood function of the negative binomial regression as a form of partial lasso regularization. A parameter λ is used to control the importance of the regularization term. Equity-aware SPFs are developed for pedestrian-involved crashes at the census tract level in Virginia, USA, and then employed to compute the potential for safety improvement (PSI), a prevalent metric used in hotspot identification. Results show that equity-aware SPFs can diminish the effects of equity-related variables, including poverty ratio, black ratio, Asian ratio, and the ratio of households without vehicles, on the expected crash frequencies, generating higher PSIs for disadvantaged areas. Based on the results of Wilcoxon signed-rank tests, it is evident that there are significant differences in the rankings of PSIs when equity awareness is considered, especially for disadvantaged areas. This study adds to the literature a new quantitative approach to harmonize equity and effectiveness considerations, empowering more equitable decision-making in safety management, such as allocating resources for safety enhancement.


Asunto(s)
Accidentes de Tránsito , Peatones , Seguridad , Humanos , Accidentes de Tránsito/prevención & control , Accidentes de Tránsito/estadística & datos numéricos , Peatones/estadística & datos numéricos , Virginia , Funciones de Verosimilitud , Poblaciones Vulnerables , Administración de la Seguridad , Renta
3.
Stat Methods Med Res ; : 9622802241267355, 2024 Aug 19.
Artículo en Inglés | MEDLINE | ID: mdl-39158499

RESUMEN

In cancer research, basket trials aim to assess the efficacy of a drug using baskets, wherein patients are organized into subgroups according to their tumor type. In this context, using information borrowing strategy may increase the probability of detecting drug efficacy in active baskets, by shrinking together the estimates of the parameters characterizing the drug efficacy in baskets with similar drug activity. Here, we propose to use fusion-penalized logistic regression models to borrow information in the setting of a phase 2 single-arm basket trial with binary outcome. We describe our proposed strategy and assess its performance via a simulation study. We assessed the impact of heterogeneity in drug efficacy, prevalence of each tumor types and implementation of interim analyses on the operating characteristics of our proposed design. We compared our approach with two existing designs, relying on the specification of prior information in a Bayesian framework to borrow information across similar baskets. Notably, our approach performed well when the effect of the drug varied greatly across the baskets. Our approach offers several advantages, including limited implementation efforts and fast computation, which is essential when planning a new trial as such planning requires intensive simulation studies.

4.
Int J Biometeorol ; 2024 Aug 31.
Artículo en Inglés | MEDLINE | ID: mdl-39215818

RESUMEN

Crop yield prediction gains growing importance for all stakeholders in agriculture. Since the growth and development of crops are fully connected with many weather factors, it is inevitable to incorporate meteorological information into yield prediction mechanism. The changes in climate-yield relationship are more pronounced at a local level than across relatively large regions. Hence, district or sub-region-level modeling may be an appropriate approach. To obtain a location- and crop-specific model, different models with different functional forms have to be explored. This systematic review aims to discuss research papers related to statistical and machine-learning models commonly used to predict crop yield using weather factors. It was found that Artificial Neural Network (ANN) and Multiple Linear Regression were the most applied models. Support Vector Regression (SVR) model has a high success ratio as it performed well in most of the cases. The optimization options in ANN and SVR models allow us to tune models to specific patterns of association between weather conditions of a location and crop yield. ANN model can be trained using different activation functions with optimized learning rate and number of hidden layer neurons. Similarly, the SVR model can be trained with different kernel functions and various combinations of hyperparameters. Penalized regression models namely, LASSO and Elastic Net are better alternatives to simple linear regression. The nonlinear machine learning models namely, SVR and ANN were found to perform better in most of the cases which indicates there exists a nonlinear complex association between crop yield and weather factors.

5.
Proc Natl Acad Sci U S A ; 121(33): e2403210121, 2024 Aug 13.
Artículo en Inglés | MEDLINE | ID: mdl-39110727

RESUMEN

Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.


Asunto(s)
Estudio de Asociación del Genoma Completo , Herencia Multifactorial , Humanos , Herencia Multifactorial/genética , Estudio de Asociación del Genoma Completo/métodos , Aprendizaje Automático , Predisposición Genética a la Enfermedad , Polimorfismo de Nucleótido Simple
6.
J Psychiatr Res ; 176: 442-451, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38981238

RESUMEN

Despite previous efforts to build statistical models for predicting the risk of suicidal behavior using machine-learning analysis, a high-accuracy model can lead to overfitting. Furthermore, internal validation cannot completely address this problem. In this study, we created models for predicting the occurrence of suicide attempts among Koreans at high risk of suicide, and we verified these models in an independent cohort. We performed logistic and penalized regression for suicide attempts within 6 months among suicidal ideators and attempters in The Korean Cohort for the Model Predicting a Suicide and Suicide-related Behavior (K-COMPASS). We then validated the models in a test cohort. Our findings indicated that several factors significantly predicted suicide attempts in the models, including young age, suicidal ideation, previous suicidal attempts, anxiety, alcohol abuse, stress, and impulsivity. The area under the curve and positive predictive values were 0.941 and 0.484 after variable selection and 0.751 and 0.084 in the test cohort. The corresponding values for the penalized regression model were 0.943 and 0.524 in the original training cohort and 0.794 and 0.115 in the test cohort. The prediction model constructed through a prospective cohort study of the suicide high-risk group showed satisfactory accuracy even in the test cohort. The accuracy with penalized regression was greater than that with the "classical" logistic model.


Asunto(s)
Aprendizaje Automático , Ideación Suicida , Intento de Suicidio , Humanos , Intento de Suicidio/estadística & datos numéricos , Masculino , Femenino , República de Corea/epidemiología , Adulto , Adulto Joven , Estudios Prospectivos , Modelos Logísticos , Persona de Mediana Edad , Adolescente , Factores de Riesgo
7.
Comput Struct Biotechnol J ; 23: 2478-2486, 2024 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38952424

RESUMEN

Gene expression plays a pivotal role in various diseases, contributing significantly to their mechanisms. Most GWAS risk loci are in non-coding regions, potentially affecting disease risk by altering gene expression in specific tissues. This expression is notably tissue-specific, with genetic variants substantially influencing it. However, accurately detecting the expression Quantitative Trait Loci (eQTL) is challenging due to limited heritability in gene expression, extensive linkage disequilibrium (LD), and multiple causal variants. The single variant association approach in eQTL analysis is limited by its susceptibility to capture the combined effects of multiple variants, and a bias towards common variants, underscoring the need for a more robust method to accurately identify causal eQTL variants. To address this, we developed an algorithm, CausalEQTL, which integrates L 0 +L 1 penalized regression with an ensemble approach to localize eQTL, thereby enhancing prediction performance precisely. Our results demonstrate that CausalEQTL outperforms traditional models, including LASSO, Elastic Net, Ridge, in terms of power and overall performance. Furthermore, analysis of heart tissue data from the GTEx project revealed that eQTL sites identified by our algorithm provide deeper insights into heart-related tissue eQTL detection. This advancement in eQTL mapping promises to improve our understanding of the genetic basis of tissue-specific gene expression and its implications in disease. The source code and identified causal eQTLs for CausalEQTL are available on GitHub: https://github.com/zhc-moushang/CausalEQTL.

8.
Genet Epidemiol ; 2024 Jul 09.
Artículo en Inglés | MEDLINE | ID: mdl-38982682

RESUMEN

The prediction of the susceptibility of an individual to a certain disease is an important and timely research area. An established technique is to estimate the risk of an individual with the help of an integrated risk model, that is, a polygenic risk score with added epidemiological covariates. However, integrated risk models do not capture any time dependence, and may provide a point estimate of the relative risk with respect to a reference population. The aim of this work is twofold. First, we explore and advocate the idea of predicting the time-dependent hazard and survival (defined as disease-free time) of an individual for the onset of a disease. This provides a practitioner with a much more differentiated view of absolute survival as a function of time. Second, to compute the time-dependent risk of an individual, we use published methodology to fit a Cox's proportional hazard model to data from a genetic SNP study of time to Alzheimer's disease (AD) onset, using the lasso to incorporate further epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status, 10 leading principal components, and selected genomic loci. We apply the lasso for Cox's proportional hazards to a data set of 6792 AD patients (composed of 4102 cases and 2690 controls) and 87 covariates. We demonstrate that fitting a lasso model for Cox's proportional hazards allows one to obtain more accurate survival curves than with state-of-the-art (likelihood-based) methods. Moreover, the methodology allows one to obtain personalized survival curves for a patient, thus giving a much more differentiated view of the expected progression of a disease than the view offered by integrated risk models. The runtime to compute personalized survival curves is under a minute for the entire data set of AD patients, thus enabling it to handle datasets with 60,000-100,000 subjects in less than 1 h.

9.
Aging (Albany NY) ; 16(13): 10724-10748, 2024 Jul 09.
Artículo en Inglés | MEDLINE | ID: mdl-38985449

RESUMEN

Chronological age reveals the number of years an individual has lived since birth. By contrast, biological age varies between individuals of the same chronological age at a rate reflective of physiological decline. Differing rates of physiological decline are related to longevity and result from genetics, environment, behavior, and disease. The creation of methylation biological age predictors is a long-standing challenge in aging research due to the lack of individual pre-mortem longevity data. The consistent differences in longevity between domestic dog breeds enable the construction of biological age estimators which can, in turn, be contrasted with methylation measurements to elucidate mechanisms of biological aging. We draw on three flagship methylation studies using distinct measurement platforms and tissues to assess the feasibility of creating biological age methylation clocks in the dog. We expand epigenetic clock building strategies to accommodate phylogenetic relationships between individuals, thus controlling for the use of breed standard metrics. We observe that biological age methylation clocks are affected by population stratification and require heavy parameterization to achieve effective predictions. Finally, we observe that methylation-related markers reflecting biological age signals are rare and do not colocalize between datasets.


Asunto(s)
Envejecimiento , Metilación de ADN , Longevidad , Animales , Perros , Envejecimiento/genética , Longevidad/genética , Epigénesis Genética
10.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38836403

RESUMEN

In precision medicine, both predicting the disease susceptibility of an individual and forecasting its disease-free survival are areas of key research. Besides the classical epidemiological predictor variables, data from multiple (omic) platforms are increasingly available. To integrate this wealth of information, we propose new methodology to combine both cooperative learning, a recent approach to leverage the predictive power of several datasets, and polygenic hazard score models. Polygenic hazard score models provide a practitioner with a more differentiated view of the predicted disease-free survival than the one given by merely a point estimate, for instance computed with a polygenic risk score. Our aim is to leverage the advantages of cooperative learning for the computation of polygenic hazard score models via Cox's proportional hazard model, thereby improving the prediction of the disease-free survival. In our experimental study, we apply our methodology to forecast the disease-free survival for Alzheimer's disease (AD) using three layers of data. One layer contains epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status and 10 leading principal components. Another layer contains selected genomic loci, and the last layer contains methylation data for selected CpG sites. We demonstrate that the survival curves computed via cooperative learning yield an AUC of around $0.7$, above the state-of-the-art performance of its competitors. Importantly, the proposed methodology returns (1) a linear score that can be easily interpreted (in contrast to machine learning approaches), and (2) a weighting of the predictive power of the involved data layers, allowing for an assessment of the importance of each omic (or other) platform. Similarly to polygenic hazard score models, our methodology also allows one to compute individual survival curves for each patient.


Asunto(s)
Enfermedad de Alzheimer , Medicina de Precisión , Humanos , Medicina de Precisión/métodos , Enfermedad de Alzheimer/genética , Enfermedad de Alzheimer/mortalidad , Supervivencia sin Enfermedad , Aprendizaje Automático , Modelos de Riesgos Proporcionales , Herencia Multifactorial , Masculino , Femenino , Multiómica
11.
J Biopharm Stat ; : 1-7, 2024 Apr 05.
Artículo en Inglés | MEDLINE | ID: mdl-38578223

RESUMEN

We describe an approach for combining and analyzing high-dimensional genomic and low-dimensional phenotypic data. The approach leverages a scheme of weights applied to the variables instead of observations and, hence, permits incorporation of the information provided by the low dimensional data source. It can also be incorporated into commonly used downstream techniques, such as random forest or penalized regression. Finally, the simulated lupus studies involving genetic and clinical data are used to illustrate the overall idea and show that the proposed enriched penalized method can select significant genetic variables while keeping several important clinical variables in the final model.

12.
Stat Med ; 43(6): 1119-1134, 2024 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-38189632

RESUMEN

Tuning hyperparameters, such as the regularization parameter in Ridge or Lasso regression, is often aimed at improving the predictive performance of risk prediction models. In this study, various hyperparameter tuning procedures for clinical prediction models were systematically compared and evaluated in low-dimensional data. The focus was on out-of-sample predictive performance (discrimination, calibration, and overall prediction error) of risk prediction models developed using Ridge, Lasso, Elastic Net, or Random Forest. The influence of sample size, number of predictors and events fraction on performance of the hyperparameter tuning procedures was studied using extensive simulations. The results indicate important differences between tuning procedures in calibration performance, while generally showing similar discriminative performance. The one-standard-error rule for tuning applied to cross-validation (1SE CV) often resulted in severe miscalibration. Standard non-repeated and repeated cross-validation (both 5-fold and 10-fold) performed similarly well and outperformed the other tuning procedures. Bootstrap showed a slight tendency to more severe miscalibration than standard cross-validation-based tuning procedures. Differences between tuning procedures were larger for smaller sample sizes, lower events fractions and fewer predictors. These results imply that the choice of tuning procedure can have a profound influence on the predictive performance of prediction models. The results support the application of standard 5-fold or 10-fold cross-validation that minimizes out-of-sample prediction error. Despite an increased computational burden, we found no clear benefit of repeated over non-repeated cross-validation for hyperparameter tuning. We warn against the potentially detrimental effects on model calibration of the popular 1SE CV rule for tuning prediction models in low-dimensional settings.


Asunto(s)
Proyectos de Investigación , Humanos , Simulación por Computador , Tamaño de la Muestra
13.
Biom J ; 66(1): e2200092, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37068189

RESUMEN

Quantifying drug potency, which requires an accurate estimation of dose-response relationship, is essential for drug development in biomedical research and life sciences. However, the standard estimation procedure of the median-effect equation to describe the dose-response curve is vulnerable to extreme observations in common experimental data. To facilitate appropriate statistical inference, many powerful estimation tools have been developed in R, including various dose-response packages based on the nonlinear least squares method with different optimization strategies. Recently, beta regression-based methods have also been introduced in estimation of the median-effect equation. In theory, they can overcome nonnormality, heteroscedasticity, and asymmetry and accommodate flexible robust frameworks and coefficients penalization. To identify a reliable estimation method(s) to estimate dose-response curves even with extreme observations, we conducted a comparative study to review 14 different tools in R and examine their robustness and efficiency via Monte Carlo simulation under a list of comprehensive scenarios. The simulation results demonstrate that penalized beta regression using the mgcv package outperforms other methods in terms of stable, accurate estimation, and reliable uncertainty quantification.


Asunto(s)
Simulación por Computador , Análisis de Regresión , Incertidumbre , Método de Montecarlo
14.
Stat Sin ; 33(1): 27-53, 2023 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37854586

RESUMEN

In modern scientific research, data heterogeneity is commonly observed owing to the abundance of complex data. We propose a factor regression model for data with heterogeneous subpopulations. The proposed model can be represented as a decomposition of heterogeneous and homogeneous terms. The heterogeneous term is driven by latent factors in different subpopulations. The homogeneous term captures common variation in the covariates and shares common regression coefficients across subpopulations. Our proposed model attains a good balance between a global model and a group-specific model. The global model ignores the data heterogeneity, while the group-specific model fits each subgroup separately. We prove the estimation and prediction consistency for our proposed estimators, and show that it has better convergence rates than those of the group-specific and global models. We show that the extra cost of estimating latent factors is asymptotically negligible and the minimax rate is still attainable. We further demonstrate the robustness of our proposed method by studying its prediction error under a mis-specified group-specific model. Finally, we conduct simulation studies and analyze a data set from the Alzheimer's Disease Neuroimaging Initiative and an aggregated microarray data set to further demonstrate the competitiveness and interpretability of our proposed factor regression model.

15.
Brain Sci ; 13(8)2023 Jul 28.
Artículo en Inglés | MEDLINE | ID: mdl-37626488

RESUMEN

Fear extinction is the basis of exposure therapies for posttraumatic stress disorder (PTSD), but half of patients do not improve. Predicting fear extinction in individuals with PTSD may inform personalized exposure therapy development. The participants were 125 trauma-exposed adults (96 female) with a range of PTSD symptoms. Electromyography, electrocardiogram, and skin conductance were recorded at baseline, during dark-enhanced startle, and during fear conditioning and extinction. Using a cross-validated, hold-out sample prediction approach, three penalized regressions and conventional ordinary least squares were trained to predict fear-potentiated startle during extinction using 50 predictor variables (5 clinical, 24 self-reported, and 21 physiological). The predictors, selected by penalized regression algorithms, were included in multivariable regression analyses, while univariate regressions assessed individual predictors. All the penalized regressions outperformed OLS in prediction accuracy and generalizability, as indexed by the lower mean squared error in the training and holdout subsamples. During early extinction, the consistent predictors across all the modeling approaches included dark-enhanced startle, the depersonalization and derealization subscale of the dissociative experiences scale, and the PTSD hyperarousal symptom score. These findings offer novel insights into the modeling approaches and patient characteristics that may reliably predict fear extinction in PTSD. Penalized regression shows promise for identifying symptom-related variables to enhance the predictive modeling accuracy in clinical research.

16.
Cell J ; 25(8): 536-545, 2023 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-37641415

RESUMEN

OBJECTIVE: Metabolic syndrome (MetS) is a complex multifactorial disorder that considerably burdens healthcare systems. We aim to classify MetS using regularized machine learning models in the presence of the risk variants of GCKR, BUD13 and APOA5, and environmental risk factors. MATERIALS AND METHODS: A cohort study was conducted on 2,346 cases and 2,203 controls from eligible Tehran Cardiometabolic Genetic Study (TCGS) participants whose data were collected from 1999 to 2017. We used different regularization approaches [least absolute shrinkage and selection operator (LASSO), ridge regression (RR), elasticnet (ENET), adaptive LASSO (aLASSO), and adaptive ENET (aENET)] and a classical logistic regression (LR) model to classify MetS and select influential variables that predict MetS. Demographics, clinical features, and common polymorphisms in the GCKR, BUD13 and APOA5 genes of eligible participants were assessed to classify TCGS participant status in MetS development. The models' performance was evaluated by 10-repeated 10-fold crossvalidation. Various assessment measures of sensitivity, specificity, classification accuracy, and area under the receiver operating characteristic curve (AUC-ROC) and AUC-precision-recall (AUC-PR) curves were used to compare the models. RESULTS: During the follow-up period, 50.38% of participants developed MetS. The groups were not similar in terms of baseline characteristics and risk variants. MetS was significantly associated with age, gender, schooling years, body mass index (BMI), and alternate alleles in all the risk variants, as indicated by LR. A comparison of accuracy, AUCROC, and AUC-PR metrics indicated that the regularization models outperformed LR. Regularized machine learning models provided comparable classification performances, whereas the aLASSO model was more parsimonious and selected fewer predictors. CONCLUSION: Regularized machine learning models provided more accurate and parsimonious MetS classifying models. These high-performing diagnostic models can lay the foundation for clinical decision support tools that use genetic and demographical variables to locate individuals at high risk for MetS.

17.
J Biopharm Stat ; : 1-25, 2023 Jul 17.
Artículo en Inglés | MEDLINE | ID: mdl-37455635

RESUMEN

We propose a new approach to select the regularization parameter using a new version of the generalized information criterion (GIC) in the subject of penalized regression. We prove the identifiability of bridge regression model as a prerequisite of statistical modeling. Then, we propose asymptotically efficient generalized information criterion (AGIC) and prove that it has asymptotic loss efficiency. Also, we verified the better performance of AGIC in comparison to the older versions of GIC. Furthermore, we propose MSE search paths to order the selected features by lasso regression based on numerical studies. The MSE search paths provide a way to cover the lack of feature ordering in lasso regression model. The performance of AGIC with other types of GIC is compared using MSE and model utility in simulation study. We exert AGIC and other criteria to analyze breast and prostate cancer and Parkinson disease datasets. The results confirm the superiority of AGIC in almost all situations.

18.
Stat Sin ; 33(2): 633-662, 2023 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-37197479

RESUMEN

Recent technological advances have made it possible to measure multiple types of many features in biomedical studies. However, some data types or features may not be measured for all study subjects because of cost or other constraints. We use a latent variable model to characterize the relationships across and within data types and to infer missing values from observed data. We develop a penalized-likelihood approach for variable selection and parameter estimation and devise an efficient expectation-maximization algorithm to implement our approach. We establish the asymptotic properties of the proposed estimators when the number of features increases at a polynomial rate of the sample size. Finally, we demonstrate the usefulness of the proposed methods using extensive simulation studies and provide an application to a motivating multi-platform genomics study.

19.
bioRxiv ; 2023 Mar 16.
Artículo en Inglés | MEDLINE | ID: mdl-36993280

RESUMEN

Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, in silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose Hierarchical Deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon significantly outperforms existing methods and accurately estimates cellular fractions.

20.
BMC Bioinformatics ; 24(1): 82, 2023 Mar 06.
Artículo en Inglés | MEDLINE | ID: mdl-36879227

RESUMEN

BACKGROUND: One of the main challenges of microbiome analysis is its compositional nature that if ignored can lead to spurious results. Addressing the compositional structure of microbiome data is particularly critical in longitudinal studies where abundances measured at different times can correspond to different sub-compositions. RESULTS: We developed coda4microbiome, a new R package for analyzing microbiome data within the Compositional Data Analysis (CoDA) framework in both, cross-sectional and longitudinal studies. The aim of coda4microbiome is prediction, more specifically, the method is designed to identify a model (microbial signature) containing the minimum number of features with the maximum predictive power. The algorithm relies on the analysis of log-ratios between pairs of components and variable selection is addressed through penalized regression on the "all-pairs log-ratio model", the model containing all possible pairwise log-ratios. For longitudinal data, the algorithm infers dynamic microbial signatures by performing penalized regression over the summary of the log-ratio trajectories (the area under these trajectories). In both, cross-sectional and longitudinal studies, the inferred microbial signature is expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively. The package provides several graphical representations that facilitate the interpretation of the analysis and the identified microbial signatures. We illustrate the new method with data from a Crohn's disease study (cross-sectional data) and on the developing microbiome of infants (longitudinal data). CONCLUSIONS: coda4microbiome is a new algorithm for identification of microbial signatures in both, cross-sectional and longitudinal studies. The algorithm is implemented as an R package that is available at CRAN ( https://cran.r-project.org/web/packages/coda4microbiome/ ) and is accompanied with a vignette with a detailed description of the functions. The website of the project contains several tutorials: https://malucalle.github.io/coda4microbiome/.


Asunto(s)
Algoritmos , Microbiota , Lactante , Humanos , Estudios Transversales , Análisis de Datos , Estudios Longitudinales
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA