Ense-i6mA: Identification of DNA N6-methyl-adenine Sites Using XGB-RFE Feature Se-lection and Ensemble Machine Learning.
IEEE/ACM Trans Comput Biol Bioinform
; PP2024 Jul 01.
Article
en En
| MEDLINE
| ID: mdl-38949938
ABSTRACT
DNA N6-methyladenine (6mA) is an important epigenetic modification that plays a vital role in various cellular processes. Accurate identification of the 6mA sites is fundamental to elucidate the biological functions and mechanisms of modification. However, experimental methods for detecting 6mA sites are high-priced and time-consuming. In this study, we propose a novel computational method, called Ense-i6mA, to predict 6mA sites. Firstly, five encoding schemes, i.e., one-hot encoding, gcContent, Z-Curve, K-mer nucleotide frequency, and K-mer nucleotide frequency with gap, are employed to extract DNA sequence features. Secondly, to our knowledge, it is the first time that eXtreme gradient boosting coupled with recursive feature elimination is applied to 6mA sites prediction domain to remove noisy features for avoiding over-fitting, reducing computing time and complexity. Then, the best subset of features is fed into base-classifiers composed of Extra Trees, eXtreme Gradient Boosting, Light Gradient Boosting Machine, and Support Vector Machine. Finally, to minimize generalization errors, the prediction probabilities of the base-classifiers are aggregated by averaging for inferring the final 6mA sites results. We conduct experiments on two species, i.e., Arabidopsis thaliana and Drosophila melanogaster, to compare the performance of Ense-i6mA against the recent 6mA sites prediction methods. The experimental results demonstrate that the proposed Ense-i6mA achieves area under the receiver operating characteristic curve values of 0.967 and 0.968, accuracies of 91.4% and 92.0%, and Mathew's correlation coefficient values of 0.829 and 0.842 on two benchmark datasets, respectively, and outperforms several existing state-of-the-art methods.
Texto completo:
1
Colección:
01-internacional
Base de datos:
MEDLINE
Idioma:
En
Revista:
ACM Trans Comput Biol Bioinform
Asunto de la revista:
BIOLOGIA
/
INFORMATICA MEDICA
Año:
2024
Tipo del documento:
Article
Pais de publicación:
Estados Unidos