CDBProm: the Comprehensive Directory of Bacterial Promoters.
NAR Genom Bioinform
; 6(1): lqae018, 2024 Mar.
Article
en En
| MEDLINE
| ID: mdl-38385146
ABSTRACT
The decreasing cost of whole genome sequencing has produced high volumes of genomic information that require annotation. The experimental identification of promoter sequences, pivotal for regulating gene expression, is a laborious and cost-prohibitive task. To expedite this, we introduce the Comprehensive Directory of Bacterial Promoters (CDBProm), a directory of in-silico predicted bacterial promoter sequences. We first identified that an Extreme Gradient Boosting (XGBoost) algorithm would distinguish promoters from random downstream regions with an accuracy of 87%. To capture distinctive promoter signals, we generated a second XGBoost classifier trained on the instances misclassified in our first classifier. The predictor of CDBProm is then fed with over 55 million upstream regions from more than 6000 bacterial genomes. Upon finding potential promoter sequences in upstream regions, each promoter is mapped to the genomic data of the organism, linking the predicted promoter with its coding DNA sequence, and identifying the function of the gene regulated by the promoter. The collection of bacterial promoters available in CDBProm enables the quantitative analysis of a plethora of bacterial promoters. Our collection with over 24 million promoters is publicly available at https//aw.iimas.unam.mx/cdbprom/.
Texto completo:
1
Colección:
01-internacional
Base de datos:
MEDLINE
Idioma:
En
Revista:
NAR Genom Bioinform
Año:
2024
Tipo del documento:
Article
País de afiliación:
Canadá
Pais de publicación:
Reino Unido