KEGG orthology prediction of bacterial proteins using natural language processing.

Chen, Jing; Wu, Haoyu; Wang, Ning

Chen, Jing; Wu, Haoyu; Wang, Ning.

Afiliación

Chen J; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China.
Wu H; Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computing Intelligence, Jiangnan University, Wuxi, China.
Wang N; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China.

BMC Bioinformatics ; 25(1): 146, 2024 Apr 11.

Article en En | MEDLINE | ID: mdl-38600441

ABSTRACT

ABSTRACT

BACKGROUND:

The advent of high-throughput technologies has led to an exponential increase in uncharacterized bacterial protein sequences, surpassing the capacity of manual curation. A large number of bacterial protein sequences remain unannotated by Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology, making it necessary to use auto annotation tools. These tools are now indispensable in the biological research landscape, bridging the gap between the vastness of unannotated sequences and meaningful biological insights.

RESULTS:

In this work, we propose a novel pipeline for KEGG orthology annotation of bacterial protein sequences that uses natural language processing and deep learning. To assess the effectiveness of our pipeline, we conducted evaluations using the genomes of two randomly selected species from the KEGG database. In our evaluation, we obtain competitive results on precision, recall, and F1 score, with values of 0.948, 0.947, and 0.947, respectively.

CONCLUSIONS:

Our experimental results suggest that our pipeline demonstrates performance comparable to traditional methods and excels in identifying distant relatives with low sequence identity. This demonstrates the potential of our pipeline to significantly improve the accuracy and comprehensiveness of KEGG orthology annotation, thereby advancing our understanding of functional relationships within biological systems.

Asunto(s)

Proteínas Bacterianas; Procesamiento de Lenguaje Natural; Genoma; Anotación de Secuencia Molecular; Secuencia de Aminoácidos

Palabras clave

Deep learning; KEGG orthology; Protein function prediction; Protein language model

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Proteínas Bacterianas / Procesamiento de Lenguaje Natural Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: China Pais de publicación: Reino Unido

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google