RESUMEN
KEY MESSAGE: Current genome-enabled prediction models assumed errors normally distributed, which are sensitive to outliers. We propose a model with errors assumed to follow a Laplace distribution to deal better with outliers. Current genome-enabled prediction models use regressions that fit the expected value (mean) of a response variable with errors assumed normally distributed, which are often sensitive to outliers, either genetic or environmental. For this reason, we propose a robust Bayesian genome median regression (BGMR) model that fits regressions to the medians of a distribution, with errors assumed to follow a Laplace distribution to deal better with outliers. The BGMR model was evaluated under a Bayesian framework with Markov Chain Monte Carlo sampling using a location-scale mixture representation of the Laplace distribution. The BGMR was implemented with two simulated and two real genomic data sets, and we compared its prediction performance with that of a conventional genomic best linear unbiased prediction (GBLUP) model and the Laplace maximum a posteriori (LMAP) method. The prediction accuracies of BGMR were higher than those of the GBLUP and LMAP methods when there were outliers. The BGMR model could be useful to breeders who need to predict and select genotypes based on data with unknown outliers.
Asunto(s)
Cruzamiento , Genoma de Planta , Modelos Teóricos , Plantas/genética , Teorema de Bayes , Simulación por Computador , Cadenas de Markov , Método de Montecarlo , Análisis de RegresiónRESUMEN
Deep learning (DL) is a promising method for genomic-enabled prediction. However, the implementation of DL is difficult because many hyperparameters (number of hidden layers, number of neurons, learning rate, number of epochs, batch size, etc.) need to be tuned. For this reason, deep kernel methods, which only require defining the number of layers, may be an attractive alternative. Deep kernel methods emulate DL models with a large number of neurons, but are defined by relatively easily computed covariance matrices. In this research, we compared the genome-based prediction of DL to a deep kernel (arc-cosine kernel, AK), to the commonly used non-additive Gaussian kernel (GK), as well as to the conventional additive genomic best linear unbiased predictor (GBLUP/GB). We used two real wheat data sets for benchmarking these methods. On average, AK and GK outperformed DL and GB. The gain in terms of prediction performance of AK and GK over DL and GB was not large, but AK and GK have the advantage that only one parameter, the number of layers (AK) or the bandwidth parameter (GK), has to be tuned in each method. Furthermore, although AK and GK had similar performance, deep kernel AK is easier to implement than GK, since the parameter "number of layers" is more easily determined than the bandwidth parameter of GK. Comparing AK and DL for the data set of year 2015-2016, the difference in performance of the two methods was bigger, with AK predicting much better than DL. On this data, the optimization of the hyperparameters for DL was difficult and the finally used parameters may have been suboptimal. Our results suggest that AK is a good alternative to DL with the advantage that practically no tuning process is required.
RESUMEN
Biology is characterized by complex interactions between phenotypes, such as recursive and simultaneous relationships between substrates and enzymes in biochemical systems. Structural equation models (SEMs) can be used to study such relationships in multivariate analyses, e.g., with multiple traits in a quantitative genetics context. Nonetheless, the number of different recursive causal structures that can be used for fitting a SEM to multivariate data can be huge, even when only a few traits are considered. In recent applications of SEMs in mixed-model quantitative genetics settings, causal structures were preselected on the basis of prior biological knowledge alone. Therefore, the wide range of possible causal structures has not been properly explored. Alternatively, causal structure spaces can be explored using algorithms that, using data-driven evidence, can search for structures that are compatible with the joint distribution of the variables under study. However, the search cannot be performed directly on the joint distribution of the phenotypes as it is possibly confounded by genetic covariance among traits. In this article we propose to search for recursive causal structures among phenotypes using the inductive causation (IC) algorithm after adjusting the data for genetic effects. A standard multiple-trait model is fitted using Bayesian methods to obtain a posterior covariance matrix of phenotypes conditional to unobservable additive genetic effects, which is then used as input for the IC algorithm. As an illustrative example, the proposed methodology was applied to simulated data related to multiple traits measured on a set of inbred lines.
Asunto(s)
Algoritmos , Análisis Factorial , Teorema de Bayes , Humanos , Análisis Multivariante , FenotipoRESUMEN
Dark spots in the fleece area are often associated with dark fibres in wool, which limits its competitiveness with other textile fibres. Field data from a sheep experiment in Uruguay revealed an excess number of zeros for dark spots. We compared the performance of four Poisson and zero-inflated Poisson (ZIP) models under four simulation scenarios. All models performed reasonably well under the same scenario for which the data were simulated. The deviance information criterion favoured a Poisson model with residual, while the ZIP model with a residual gave estimates closer to their true values under all simulation scenarios. Both Poisson and ZIP models with an error term at the regression level performed better than their counterparts without such an error. Field data from Corriedale sheep were analysed with Poisson and ZIP models with residuals. Parameter estimates were similar for both models. Although the posterior distribution of the sire variance was skewed due to a small number of rams in the dataset, the median of this variance suggested a scope for genetic selection. The main environmental factor was the age of the sheep at shearing. In summary, age related processes seem to drive the number of dark spots in this breed of sheep.
Asunto(s)
Envejecimiento/genética , Pigmentos Biológicos/genética , Oveja Doméstica/crecimiento & desarrollo , Oveja Doméstica/genética , Lana/crecimiento & desarrollo , Animales , Teorema de Bayes , Simulación por Computador , Funciones de Verosimilitud , Modelos Estadísticos , Distribución de Poisson , Análisis de Regresión , UruguayRESUMEN
Molar content of guanine plus cytosine (G + C) and optimal growth temperature (OGT) are main factors characterizing the frequency distribution of amino acids in prokaryotes. Previous work, using multivariate exploratory methods, has emphasized ascertainment of biological factors underlying variability between genomes, but the strength of each identified factor on amino acid content has not been quantified. We combine the flexibility of the phylogenetic mixed model (PMM) with the power of Bayesian inference via Markov Chain Monte Carlo (MCMC) methods, to obtain a novel evolutionary picture of amino acid usage in prokaryotic genomes. We implement a Bayesian PMM which incorporates the feature that evolutionary history makes observed data interdependent. As in previous studies with PMM, we present a variance partition; however, attention is also given to the posterior distribution of "systematic effects" that may shed light about the relative importance of and relationships between evolutionary forces acting at the genomic level. In particular, we analyzed influences of G + C, OGT, and respiratory metabolism. Estimates of G + C effects were significant for amino acids coded by G + C or molar content of adenine plus thymine (A + T) in first and second bases. OGT had an important effect on 12 amino acids, probably reflecting complex patterns of protein modifications, to cope with varying environments. The effect of respiratory metabolism was less clear, probably due to the already reported association of G + C with aerobic metabolism. A "heritability" parameter was always high and significant, reinforcing the importance of accommodating phylogenetic relationships in these analyses. "Heritable" component correlations displayed a pattern that tended to cluster "pure" G + C (A + T) in first and second codon positions, suggesting an inherited departure from linear regression on G + C.
Asunto(s)
Aminoácidos/genética , Composición de Base/genética , Genoma/genética , Modelos Genéticos , Filogenia , Células Procariotas , Teorema de Bayes , Codón/genética , Cadenas de Markov , Método de MontecarloRESUMEN
The advent of molecular markers has created opportunities for a better understanding of quantitative inheritance and for developing novel strategies for genetic improvement of agricultural species, using information on quantitative trait loci (QTL). A QTL analysis relies on accurate genetic marker maps. At present, most statistical methods used for map construction ignore the fact that molecular data may be read with error. Often, however, there is ambiguity about some marker genotypes. A Bayesian MCMC approach for inferences about a genetic marker map when random miscoding of genotypes occurs is presented, and simulated and real data sets are analyzed. The results suggest that unless there is strong reason to believe that genotypes are ascertained without error, the proposed approach provides more reliable inference on the genetic map.