Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 83
Filtrar
1.
J Am Stat Assoc ; 119(545): 81-94, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39185398

RESUMEN

In the emerging field of materials informatics, a fundamental task is to identify physicochemically meaningful descriptors, or materials genes, which are engineered from primary features and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by the astronomically large number of correlated predictors with limited sample size. We formulate this problem as variable selection with operator-induced structure (OIS) and propose a new method to achieve unconventional dimension reduction by utilizing the geometry embedded in OIS. Although the model remains linear, we iterate nonparametric variable selection for effective dimension reduction. This enables variable selection based on ab initio primary features, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. To select the nonparametric module, we discuss a desired performance criterion that is uniquely induced by variable selection with OIS; in particular, we propose to employ a Bayesian Additive Regression Trees (BART)-based variable selection method. Numerical studies show superiority of the proposed method, which continues to exhibit robust performance when the input dimension is out of reach of existing methods. Our analysis of single-atom catalysis identifies physical descriptors that explain the binding energy of metal-support pairs with high explanatory power, leading to interpretable insights to guide the prevention of a notorious problem called sintering and aid catalysis design.

2.
J Am Stat Assoc ; 119(545): 320-331, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38716405

RESUMEN

There is a growing interest in the estimation of the number of unseen features, mostly driven by biological applications. A recent work brought out a peculiar property of the popular completely random measures (CRMs) as prior models in Bayesian nonparametric (BNP) inference for the unseen-features problem: for fixed prior's parameters, they all lead to a Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. CRMs are thus not a flexible prior model for the unseen-features problem and, while the Poisson posterior distribution may be appealing for analytical tractability and ease of interpretability, its independence from the sampling information makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior's parameters. In this article, we introduce the stable-Beta scaled process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution, which depends on the sampling information through the sample size and the number of distinct features, with corresponding estimates being simple, linear in the sampling information and computationally efficient. We apply our BNP approach to synthetic data and to real cancer genomic data, showing that: (i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; (ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors. Supplementary materials for this article are available online.

3.
J R Stat Soc Ser A Stat Soc ; 187(2): 496-512, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38617597

RESUMEN

Dietary assessments provide the snapshots of population-based dietary habits. Questions remain about how generalisable those snapshots are in national survey data, where certain subgroups are sampled disproportionately. We propose a Bayesian overfitted latent class model to derive dietary patterns, accounting for survey design and sampling variability. Compared to standard approaches, our model showed improved identifiability of the true population pattern and prevalence in simulation. We focus application of this model to identify the intake patterns of adults living at or below the 130% poverty income level. Five dietary patterns were identified and characterised by reproducible code/data made available to encourage further research.

4.
Biometrics ; 80(2)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38640436

RESUMEN

Several epidemiological studies have provided evidence that long-term exposure to fine particulate matter (pm2.5) increases mortality rate. Furthermore, some population characteristics (e.g., age, race, and socioeconomic status) might play a crucial role in understanding vulnerability to air pollution. To inform policy, it is necessary to identify groups of the population that are more or less vulnerable to air pollution. In causal inference literature, the group average treatment effect (GATE) is a distinctive facet of the conditional average treatment effect. This widely employed metric serves to characterize the heterogeneity of a treatment effect based on some population characteristics. In this paper, we introduce a novel Confounder-Dependent Bayesian Mixture Model (CDBMM) to characterize causal effect heterogeneity. More specifically, our method leverages the flexibility of the dependent Dirichlet process to model the distribution of the potential outcomes conditionally to the covariates and the treatment levels, thus enabling us to: (i) identify heterogeneous and mutually exclusive population groups defined by similar GATEs in a data-driven way, and (ii) estimate and characterize the causal effects within each of the identified groups. Through simulations, we demonstrate the effectiveness of our method in uncovering key insights about treatment effects heterogeneity. We apply our method to claims data from Medicare enrollees in Texas. We found six mutually exclusive groups where the causal effects of pm2.5 on mortality rate are heterogeneous.


Asunto(s)
Contaminantes Atmosféricos , Contaminación del Aire , Estados Unidos/epidemiología , Contaminantes Atmosféricos/efectos adversos , Contaminantes Atmosféricos/análisis , Teorema de Bayes , Medicare , Contaminación del Aire/efectos adversos , Contaminación del Aire/análisis , Material Particulado/efectos adversos , Material Particulado/análisis , Exposición a Riesgos Ambientales/efectos adversos
5.
J Appl Stat ; 51(2): 388-405, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38283054

RESUMEN

Maternal depression and anxiety through pregnancy have lasting societal impacts. It is thus crucial to understand the trajectories of its progression from preconception to postnatal period, and the risk factors associated with it. Within the Bayesian framework, we propose to jointly model seven outcomes, of which two are physiological and five non-physiological indicators of maternal depression and anxiety over time. We model the former two by a Gaussian process and the latter by an autoregressive model, while imposing a multidimensional Dirichlet process prior on the subject-specific random effects to account for subject heterogeneity and induce clustering. The model allows for the inclusion of covariates through a regression term. Our findings reveal four distinct clusters of trajectories of the seven health outcomes, characterising women's mental health progression from before to after pregnancy. Importantly, our results caution against the loose use of hair corticosteroids as a biomarker, or even a causal factor, for pregnancy mental health progression. Additionally, the regression analysis reveals a range of preconception determinants and risk factors for depressive and anxiety symptoms during pregnancy.

6.
Biometrics ; 79(4): 3907-3915, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-37349969

RESUMEN

In longitudinal studies, it is not uncommon to make multiple attempts to collect a measurement after baseline. Recording whether these attempts are successful provides useful information for the purposes of assessing missing data assumptions. This is because measurements from subjects who provide the data after numerous failed attempts may differ from those who provide the measurement after fewer attempts. Previous models for these designs were parametric and/or did not allow sensitivity analysis. For the former, there are always concerns about model misspecification and for the latter, sensitivity analysis is essential when conducting inference in the presence of missing data. Here, we propose a new approach which minimizes issues with model misspecification by using Bayesian nonparametrics for the observed data distribution. We also introduce a novel approach for identification and sensitivity analysis. We re-analyze the repeated attempts data from a clinical trial involving patients with severe mental illness and conduct simulations to better understand the properties of our approach.


Asunto(s)
Trastornos Mentales , Modelos Estadísticos , Humanos , Teorema de Bayes , Estudios Longitudinales
7.
J Mach Learn Res ; 24(23)2023.
Artículo en Inglés | MEDLINE | ID: mdl-37206375

RESUMEN

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic-such as a subset of variables-that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the "Stein volume criterion (SVC)", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.

8.
bioRxiv ; 2023 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-37066320

RESUMEN

Assessing dynamic processes at single molecule scales is key toward capturing life at the level of its molecular actors. Widefield superresolution methods, such as STORM, PALM, and PAINT, provide nanoscale localization accuracy, even when distances between fluorescently labeled single molecules ("emitters") fall below light's diffraction limit. However, as these superresolution methods rely on rare photophysical events to distinguish emitters from both each other and background, they are largely limited to static samples. In contrast, here we leverage spatiotemporal correlations of dynamic widefield imaging data to extend superresolution to simultaneous multiple emitter tracking without relying on photodynamics even as emitter distances from one another fall below the diffraction limit. We simultaneously determine emitter numbers and their tracks (localization and linking) with the same localization accuracy per frame as widefield superresolution does for immobilized emitters under similar imaging conditions (≈50nm). We demonstrate our results for both in cellulo data and, for benchmarking purposes, on synthetic data. To this end, we avoid the existing tracking paradigm relying on completely or partially separating the tasks of emitter number determination, localization of each emitter, and linking emitter positions across frames. Instead, we develop a fully joint posterior distribution over the quantities of interest, including emitter tracks and their total, otherwise unknown, number within the Bayesian nonparametric paradigm. Our posterior quantifies the full uncertainty over emitter numbers and their associated tracks propagated from origins including shot noise and camera artefacts, pixelation, stochastic background, and out-of-focus motion. Finally, it remains accurate in more crowded regimes where alternative tracking tools cannot be applied.

9.
Cogn Sci ; 47(4): e13262, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-37051879

RESUMEN

Humans can learn complex functional relationships between variables from small amounts of data. In doing so, they draw on prior expectations about the form of these relationships. In three experiments, we show that people learn to adjust these expectations through experience, learning about the likely forms of the functions they will encounter. Previous work has used Gaussian processes-a statistical framework that extends Bayesian nonparametric approaches to regression-to model human function learning. We build on this work, modeling the process of learning to learn functions as a form of hierarchical Bayesian inference about the Gaussian process hyperparameters.


Asunto(s)
Aprendizaje , Modelos Psicológicos , Humanos , Teorema de Bayes , Distribución Normal
10.
Biometrics ; 79(4): 3140-3152, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-36745745

RESUMEN

We propose a doubly robust approach to characterizing treatment effect heterogeneity in observational studies. We develop a frequentist inferential procedure that utilizes posterior distributions for both the propensity score and outcome regression models to provide valid inference on the conditional average treatment effect even when high-dimensional or nonparametric models are used. We show that our approach leads to conservative inference in finite samples or under model misspecification and provides a consistent variance estimator when both models are correctly specified. In simulations, we illustrate the utility of these results in difficult settings such as high-dimensional covariate spaces or highly flexible models for the propensity score and outcome regression. Lastly, we analyze environmental exposure data from NHANES to identify how the effects of these exposures vary by subject-level characteristics.


Asunto(s)
Modelos Estadísticos , Heterogeneidad del Efecto del Tratamiento , Simulación por Computador , Encuestas Nutricionales , Puntaje de Propensión
11.
Biometrics ; 79(4): 3252-3265, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-36718599

RESUMEN

Analysis of observational studies increasingly confronts the challenge of determining which of a possibly high-dimensional set of available covariates are required to satisfy the assumption of ignorable treatment assignment for estimation of causal effects. We propose a Bayesian nonparametric approach that simultaneously (1) prioritizes inclusion of adjustment variables in accordance with existing principles of confounder selection; (2) estimates causal effects in a manner that permits complex relationships among confounders, exposures, and outcomes; and (3) provides causal estimates that account for uncertainty in the nature of confounding. The proposal relies on specification of multiple Bayesian additive regression trees models, linked together with a common prior distribution that accrues posterior selection probability to covariates on the basis of association with both the exposure and the outcome of interest. A set of extensive simulation studies demonstrates that the proposed method performs well relative to similarly-motivated methodologies in a variety of scenarios. We deploy the method to investigate the causal effect of emissions from coal-fired power plants on ambient air pollution concentrations, where the prospect of confounding due to local and regional meteorological factors introduces uncertainty around the confounding role of a high-dimensional set of measured variables. Ultimately, we show that the proposed method produces more efficient and more consistent results across adjacent years than alternative methods, lending strength to the evidence of the causal relationship between SO2 emissions and ambient particulate pollution.


Asunto(s)
Contaminación del Aire , Teorema de Bayes , Contaminación del Aire/efectos adversos , Causalidad , Simulación por Computador , Incertidumbre
12.
Biostatistics ; 25(1): 220-236, 2023 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-36610075

RESUMEN

Trial-level surrogates are useful tools for improving the speed and cost effectiveness of trials but surrogates that have not been properly evaluated can cause misleading results. The evaluation procedure is often contextual and depends on the type of trial setting. There have been many proposed methods for trial-level surrogate evaluation, but none, to our knowledge, for the specific setting of platform studies. As platform studies are becoming more popular, methods for surrogate evaluation using them are needed. These studies also offer a rich data resource for surrogate evaluation that would not normally be possible. However, they also offer a set of statistical issues including heterogeneity of the study population, treatments, implementation, and even potentially the quality of the surrogate. We propose the use of a hierarchical Bayesian semiparametric model for the evaluation of potential surrogates using nonparametric priors for the distribution of true effects based on Dirichlet process mixtures. The motivation for this approach is to flexibly model relationships between the treatment effect on the surrogate and the treatment effect on the outcome and also to identify potential clusters with differential surrogate value in a data-driven manner so that treatment effects on the surrogate can be used to reliably predict treatment effects on the clinical outcome. In simulations, we find that our proposed method is superior to a simple, but fairly standard, hierarchical Bayesian method. We demonstrate how our method can be used in a simulated illustrative example (based on the ProBio trial), in which we are able to identify clusters where the surrogate is, and is not useful. We plan to apply our method to the ProBio trial, once it is completed.


Asunto(s)
Ensayos Clínicos como Asunto , Humanos , Teorema de Bayes , Resultado del Tratamiento
13.
J Theor Biol ; 558: 111351, 2023 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-36379231

RESUMEN

Whether an outbreak of infectious disease is likely to grow or dissipate is determined through the time-varying reproduction number, Rt. Real-time or retrospective identification of changes in Rt following the imposition or relaxation of interventions can thus contribute important evidence about disease transmission dynamics which can inform policymaking. Here, we present a method for estimating shifts in Rt within a renewal model framework. Our method, which we call EpiCluster, is a Bayesian nonparametric model based on the Pitman-Yor process. We assume that Rt is piecewise-constant, and the incidence data and priors determine when or whether Rt should change and how many times it should do so throughout the series. We also introduce a prior which induces sparsity over the number of changepoints. Being Bayesian, our approach yields a measure of uncertainty in Rt and its changepoints. EpiCluster is fast, straightforward to use, and we demonstrate that it provides automated detection of rapid changes in transmission, either in real-time or retrospectively, for synthetic data series where the Rt profile is known. We illustrate the practical utility of our method by fitting it to case data of outbreaks of COVID-19 in Australia and Hong Kong, where it finds changepoints coinciding with the imposition of non-pharmaceutical interventions. Bayesian nonparametric methods, such as ours, allow the volume and complexity of the data to dictate the number of parameters required to approximate the process and should find wide application in epidemiology. This manuscript was submitted as part of a theme issue on "Modelling COVID-19 and Preparedness for Future Pandemics".


Asunto(s)
COVID-19 , Humanos , Teorema de Bayes , Estudios Retrospectivos , COVID-19/epidemiología , Pandemias , Brotes de Enfermedades
14.
Stat Med ; 42(1): 33-51, 2023 01 15.
Artículo en Inglés | MEDLINE | ID: mdl-36336460

RESUMEN

In observational studies, causal inference relies on several key identifying assumptions. One identifiability condition is the positivity assumption, which requires the probability of treatment be bounded away from 0 and 1. That is, for every covariate combination, it should be possible to observe both treated and control subjects the covariate distributions should overlap between treatment arms. If the positivity assumption is violated, population-level causal inference necessarily involves some extrapolation. Ideally, a greater amount of uncertainty about the causal effect estimate should be reflected in such situations. With that goal in mind, we construct a Gaussian process model for estimating treatment effects in the presence of practical violations of positivity. Advantages of our method include minimal distributional assumptions, a cohesive model for estimating treatment effects, and more uncertainty associated with areas in the covariate space where there is less overlap. We assess the performance of our approach with respect to bias and efficiency using simulation studies. The method is then applied to a study of critically ill female patients to examine the effect of undergoing right heart catheterization.


Asunto(s)
Modelos Estadísticos , Humanos , Femenino , Probabilidad , Simulación por Computador , Sesgo
15.
Stat Med ; 42(3): 246-263, 2023 02 10.
Artículo en Inglés | MEDLINE | ID: mdl-36433639

RESUMEN

This paper introduces a nonparametric regression approach for univariate and multivariate skewed responses using Bayesian additive regression trees (BART). Existing BART methods use ensembles of decision trees to model a mean function, and have become popular recently due to their high prediction accuracy and ease of use. The usual assumption of a univariate Gaussian error distribution, however, is restrictive in many biomedical applications. Motivated by an oral health study, we provide a useful extension of BART, the skewBART model, to address this problem. We then extend skewBART to allow for multivariate responses, with information shared across the decision trees associated with different responses within the same subject. The methodology accommodates within-subject association, and allows varying skewness parameters for the varying multivariate responses. We illustrate the benefits of our multivariate skewBART proposal over existing alternatives via simulation studies and application to the oral health dataset with bivariate highly skewed responses. Our methodology is implementable via the R package skewBART, available on GitHub.


Asunto(s)
Modelos Estadísticos , Humanos , Teorema de Bayes , Simulación por Computador
16.
Biometrics ; 79(3): 2171-2183, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-36065934

RESUMEN

Wildlife monitoring for open populations can be performed using a number of different survey methods. Each survey method gives rise to a type of data and, in the last five decades, a large number of associated statistical models have been developed for analyzing these data. Although these models have been parameterized and fitted using different approaches, they have all been designed to either model the pattern with which individuals enter and/or exit the population, or to estimate the population size by accounting for the corresponding observation process, or both. However, existing approaches rely on a predefined model structure and complexity, either by assuming that parameters linked to the entry and exit pattern (EEP) are specific to sampling occasions, or by employing parametric curves to describe the EEP. Instead, we propose a novel Bayesian nonparametric framework for modeling EEPs based on the Polya tree (PT) prior for densities. Our Bayesian nonparametric approach avoids overfitting when inferring EEPs, while simultaneously allowing more flexibility than is possible using parametric curves. Finally, we introduce the replicate PT prior for defining classes of models for these data allowing us to impose constraints on the EEPs, when required. We demonstrate our new approach using capture-recapture, count, and ring-recovery data for two different case studies.


Asunto(s)
Animales Salvajes , Modelos Estadísticos , Humanos , Animales , Teorema de Bayes , Densidad de Población
17.
Int J Biostat ; 2022 Dec 30.
Artículo en Inglés | MEDLINE | ID: mdl-36584112

RESUMEN

A major focus of causal inference is the estimation of heterogeneous average treatment effects (HTE) - average treatment effects within strata of another variable of interest such as levels of a biomarker, education, or age strata. Inference involves estimating a stratum-specific regression and integrating it over the distribution of confounders in that stratum - which itself must be estimated. Standard practice involves estimating these stratum-specific confounder distributions independently (e.g. via the empirical distribution or Rubin's Bayesian bootstrap), which becomes problematic for sparsely populated strata with few observed confounder vectors. In this paper, we develop a nonparametric hierarchical Bayesian bootstrap (HBB) prior over the stratum-specific confounder distributions for HTE estimation. The HBB partially pools the stratum-specific distributions, thereby allowing principled borrowing of confounder information across strata when sparsity is a concern. We show that posterior inference under the HBB can yield efficiency gains over standard marginalization approaches while avoiding strong parametric assumptions about the confounder distribution. We use our approach to estimate the adverse event risk of proton versus photon chemoradiotherapy across various cancer types.

18.
Entropy (Basel) ; 24(12)2022 Nov 22.
Artículo en Inglés | MEDLINE | ID: mdl-36554108

RESUMEN

Hierarchical stochastic processes, such as the hierarchical Dirichlet process, hold an important position as a modelling tool in statistical machine learning, and are even used in deep neural networks. They allow, for instance, networks of probability vectors to be used in general statistical modelling, intrinsically supporting information sharing through the network. This paper presents a general theory of hierarchical stochastic processes and illustrates its use on the gamma process and the generalised gamma process. In general, most of the convenient properties of hierarchical Dirichlet processes extend to the broader family. The main construction for this corresponds to estimating the moments of an infinitely divisible distribution based on its cumulants. Various equivalences and relationships can then be applied to networks of hierarchical processes. Examples given demonstrate the duplication in non-parametric research, and presents plots of the Pitman-Yor distribution.

19.
Sensors (Basel) ; 22(23)2022 Dec 03.
Artículo en Inglés | MEDLINE | ID: mdl-36502155

RESUMEN

Wearable sensor data is relatively easily collected and provides direct measurements of movement that can be used to develop useful behavioral biomarkers. Sensitive and specific behavioral biomarkers for neurodegenerative diseases are critical to supporting early detection, drug development efforts, and targeted treatments. In this paper, we use autoregressive hidden Markov models and a time-frequency approach to create meaningful quantitative descriptions of behavioral characteristics of cerebellar ataxias from wearable inertial sensor data gathered during movement. We create a flexible and descriptive set of features derived from accelerometer and gyroscope data collected from wearable sensors worn while participants perform clinical assessment tasks, and use these data to estimate disease status and severity. A short period of data collection (<5 min) yields enough information to effectively separate patients with ataxia from healthy controls with very high accuracy, to separate ataxia from other neurodegenerative diseases such as Parkinson's disease, and to provide estimates of disease severity.


Asunto(s)
Enfermedades Cerebelosas , Enfermedad de Parkinson , Dispositivos Electrónicos Vestibles , Humanos , Movimiento , Enfermedad de Parkinson/diagnóstico , Ataxia
20.
Ann Appl Stat ; 16(4): 2626-2647, 2022 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-36338823

RESUMEN

Neuroradiologists and neurosurgeons increasingly opt to use functional magnetic resonance imaging (fMRI) to map functionally relevant brain regions for noninvasive presurgical planning and intraoperative neuronavigation. This application requires a high degree of spatial accuracy, but the fMRI signal-to-noise ratio (SNR) decreases as spatial resolution increases. In practice, fMRI scans can be collected at multiple spatial resolutions, and it is of interest to make more accurate inference on brain activity by combining data with different resolutions. To this end, we develop a new Bayesian model to leverage both better anatomical precision in high resolution fMRI and higher SNR in standard resolution fMRI. We assign a Gaussian process prior to the mean intensity function and develop an efficient, scalable posterior computation algorithm to integrate both sources of data. We draw posterior samples using an algorithm analogous to Riemann manifold Hamiltonian Monte Carlo in an expanded parameter space. We illustrate our method in analysis of presurgical fMRI data, and show in simulation that it infers the mean intensity more accurately than alternatives that use either the high or standard resolution fMRI data alone.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA