RESUMEN
Dimension reduction tools preserving similarity and graph structure such as t-SNE and UMAP can capture complex biological patterns in high-dimensional data. However, these tools typically are not designed to separate effects of interest from unwanted effects due to confounders. We introduce the partial embedding (PARE) framework, which enables removal of confounders from any distance-based dimension reduction method. We then develop partial t-SNE and partial UMAP and apply these methods to genomic and neuroimaging data. Our results show that the PARE framework can remove batch effects in single-cell sequencing data as well as separate clinical and technical variability in neuroimaging measures. We demonstrate that the PARE framework extends dimension reduction methods to highlight biological patterns of interest while effectively removing confounding effects.
RESUMEN
BACKGROUND: A common problem in neurophysiological signal processing is the extraction of meaningful information from high dimension, low sample size data (HDLSS). We present RoLDSIS (regression on low-dimension spanned input space), a regression technique based on dimensionality reduction that constrains the solution to the subspace spanned by the available observations. This avoids regularization parameters in the regression procedure, as needed in shrinkage regression methods. RESULTS: We applied RoLDSIS to the EEG data collected in a phonemic identification experiment. In the experiment, morphed syllables in the continuum /da/-/ta/ were presented as acoustic stimuli to the participants and the event-related potentials (ERP) were recorded and then represented as a set of features in the time-frequency domain via the discrete wavelet transform. Each set of stimuli was chosen from a preliminary identification task executed by the participant. Physical and psychophysical attributes were associated to each stimulus. RoLDSIS was then used to infer the neurophysiological axes, in the feature space, associated with each attribute. We show that these axes can be reliably estimated and that their separation is correlated with the individual strength of phonemic categorization. The results provided by RoLDSIS are interpretable in the time-frequency domain and may be used to infer the neurophysiological correlates of phonemic categorization. A comparison with commonly used regularized regression techniques was carried out by cross-validation. CONCLUSION: The prediction errors obtained by RoLDSIS are comparable to those obtained with Ridge Regression and smaller than those obtained with LASSO and SPLS. However, RoLDSIS achieves this without the need for cross-validation, a procedure that requires the extraction of a large amount of observations from the data and, consequently, a decreased signal-to-noise ratio when averaging trials. We show that, even though RoLDSIS is a simple technique, it is suitable for the processing and interpretation of neurophysiological signals.
Asunto(s)
Encéfalo/fisiología , Electroencefalografía/métodos , Modelos Teóricos , Procesamiento de Señales Asistido por Computador , Potenciales Evocados/fisiología , Humanos , Tamaño de la MuestraRESUMEN
ABSTRACT According to the features of texts, a text classification model is proposed. Base on this model, an optimized objective function is designed by utilizing the occurrence frequency of each feature in each category. According to the relation matrix oftext resource and features, an improved genetic algorithm is adopted for solution with integral matrix crossover, transposition and recombination of entire population. At last the sample date of manufacturing text information from professional resources database system is taken as an example to illustrate the proposed model and solution for feature dimension reduction and text classification. The crossover and mutation probabilities of algorithm are compared vertically and horizontally to determine a group of better parameters. The experiment results show that the proposed method is fast and effective.