Mapping medical image-text to a joint space via masked modeling.

Chen, Zhihong; Du, Yuhao; Hu, Jinpeng; Liu, Yang; Li, Guanbin; Wan, Xiang; Chang, Tsung-Hui

Chen, Zhihong; Du, Yuhao; Hu, Jinpeng; Liu, Yang; Li, Guanbin; Wan, Xiang; Chang, Tsung-Hui.

Afiliación

Chen Z; The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China.
Du Y; The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China.
Hu J; The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China.
Liu Y; The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China.
Li G; Sun Yat-sen University, Guangzhou, 510275, China. Electronic address: liguanbin@mail.sysu.edu.cn.
Wan X; The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China. Electronic address: wanxiang@sribd.cn.
Chang TH; The Chinese University of Hong Kong, Shenzhen, 518172, China; Shenzhen Research Institute of Big Data, Shenzhen, 518172, China.

Med Image Anal ; 91: 103018, 2024 Jan.

Article en En | MEDLINE | ID: mdl-37976867

ABSTRACT

ABSTRACT

Recently, masked autoencoders have demonstrated their feasibility in extracting effective image and text features (e.g., BERT for natural language processing (NLP) and MAE in computer vision (CV)). This study investigates the potential of applying these techniques to vision-and-language representation learning in the medical domain. To this end, we introduce a self-supervised learning paradigm, multi-modal masked autoencoders (M3AE). It learns to map medical images and texts to a joint space by reconstructing pixels and tokens from randomly masked images and texts. Specifically, we design this approach from three aspects First, taking into account the varying information densities of vision and language, we employ distinct masking ratios for input images and text, with a notably higher masking ratio for images; Second, we utilize visual and textual features from different layers for reconstruction to address varying levels of abstraction in vision and language; Third, we develop different designs for vision and language decoders. We establish a medical vision-and-language benchmark to conduct an extensive evaluation. Our experimental results exhibit the effectiveness of the proposed method, achieving state-of-the-art results on all downstream tasks. Further analyses validate the effectiveness of the various components and discuss the limitations of the proposed approach. The source code is available at https//github.com/zhjohnchan/M3AE.

Asunto(s)

Benchmarking; Lenguaje; Humanos; Programas Informáticos

Palabras clave

Masked autoencoders; Medical vision-and-language analysis; Multi-modal pre-training

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Benchmarking / Lenguaje Límite: Humans Idioma: En Revista: Med Image Anal Asunto de la revista: DIAGNOSTICO POR IMAGEM Año: 2024 Tipo del documento: Article País de afiliación: China Pais de publicación: Países Bajos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google