Current genomic deep learning models display decreased performance in cell type specific accessible regions.

Kathail, Pooja; Shuai, Richard W; Chung, Ryan; Ye, Chun Jimmie; Loeb, Gabriel B; Ioannidis, Nilah M

Kathail, Pooja; Shuai, Richard W; Chung, Ryan; Ye, Chun Jimmie; Loeb, Gabriel B; Ioannidis, Nilah M.

Afiliação

Kathail P; Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
Shuai RW; Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
Chung R; Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
Ye CJ; Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
Loeb GB; Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
Ioannidis NM; Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA.

bioRxiv ; 2024 Jul 10.

Article em En | MEDLINE | ID: mdl-39026761

ABSTRACT

ABSTRACT

Background:

A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability.

Results:

We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants.

Conclusions:

Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.

Palavras-chave

Chromatin Accessibility; Deep Learning; Variant Effect Prediction

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: BioRxiv Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos País de publicação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google