Heterogeneous biomedical entity representation learning for gene-disease association prediction.

Meng, Zhaohan; Liu, Siwei; Liang, Shangsong; Jani, Bhautesh; Meng, Zaiqiao

Meng, Zhaohan; Liu, Siwei; Liang, Shangsong; Jani, Bhautesh; Meng, Zaiqiao.

Afiliación

Meng Z; School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK.
Liu S; School of Natural and Computing Science, University of Aberdeen King's College, Aberdeen, AB24 3FX, UK.
Liang S; Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Building 1B, Masdar City, Abu Dhabi 000000, UAE.
Jani B; School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK.
Meng Z; School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK.

Brief Bioinform ; 25(5)2024 Jul 25.

Article en En | MEDLINE | ID: mdl-39154194

ABSTRACT

ABSTRACT

Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https//github.com/ZhaohanM/FusionGDA.

Asunto(s)

Aprendizaje Automático; Humanos; Biología Computacional/métodos; Predisposición Genética a la Enfermedad; Semántica; Algoritmos; Estudios de Asociación Genética; Redes Neurales de la Computación

Palabras clave

contrastive learning; disease; fusion module; gene; pre-trained language model; pre-training

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Aprendizaje Automático Límite: Humans Idioma: En Revista: Brief Bioinform Asunto de la revista: BIOLOGIA / INFORMATICA MEDICA Año: 2024 Tipo del documento: Article Pais de publicación: Reino Unido

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google