Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning

Chong Ma, Hanqi Jiang, Wenting Chen, Yiwei Li, Zihao Wu, Xiaowei Yu, Zhengliang Liu, Lei Guo, Dajiang Zhu, Tuo Zhang, Dinggang Shen , Tianming Liu , Xiang Li C. Ma, L. Guo, and T. Zhang are with the School of Automation, Northwestern Polytechnical University, Xi’an, 710072, China. (e-mail: [email protected], {lguo, tuozhang}@nwpu.edu.cn).W. Chen is with the Department of Electronic Engineering, City University of Hong Kong, Hong Kong SAR, China (e-mail: [email protected]).D. Zhu and X. Yu are with the Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA, (e-mail: [email protected], [email protected]).D. Shen is with the School of Biomedical Engineering, ShanghaiTech University, Shanghai 201210, China, and Department of Research and Development, Shanghai United Imaging Intelligence Co., Ltd., Shanghai 200030, China, and also with Shanghai Clinical Research and Trial Center, Shanghai, 201210, China. (e-mail: [email protected]).Z. Wu, H. Jiang, Y. Li, Z. Liu, and T. Liu are with the school of computing, University of Georgia, Athens, GA 30602, USA. (e-mail: {zw63397, hj67104, yl80817, zl18864, tliu}@uga.edu).X. Li is with the Department of Radiology, Massachusetts General Hospital, Boston 02114, USA, (e-mail: [email protected]).
Abstract

In the medical multi-modal frameworks, the alignment of cross-modality features presents a significant challenge. However, existing works have learned features that are implicitly aligned from the data, without considering the explicit relationships in the medical context. This data-reliance may lead to low generalization of the learned alignment relationships. In this work, we propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness eye-gaze data for better alignment of medical visual and textual features. We explore the natural auxiliary role of radiologists’ eye-gaze data in aligning medical images and text, and introduce a novel approach by using eye-gaze data, collected synchronously by radiologists during diagnostic evaluations. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets, where EGMA achieved state-of-the-art performance and stronger generalization across different datasets. Additionally, we explore the impact of varying amounts of eye-gaze data on model performance, highlighting the feasibility and utility of integrating this auxiliary data into multi-modal alignment framework.

Index Terms:
Medical Multi-modal Alignment, Eye-gaze, Radiology.

I Introduction

With the development of multi-modal learning, pre-trained models can now utilize large amounts of paired multi-modal data, such as image-text pairs, audio-text pairs, etc., to optimize the multi-modal feature extraction and alignment capabilities. With the emergence of the CLIP�[1] model, contrastive learning has become the prominent framework of multi-modal learning. The advantage of this framework lies in its simplicity of structure and it does not require sample-level annotations. However, the main drawback is its heavy reliance on the scale of training data. Subsequent works have optimized this framework by leveraging potential auxiliary information between image and text data. For instance, GLIP�[2] and RegionCLIP�[3] utilized pre-predicted annotation information to perform fine-grained region-level pre-training. They introduced detection networks firstly to predict image regions relevant to the text prompt, and then trained the model to align these image regions with their corresponding text descriptions. However, these models heavily rely on the performance of the ROI detector and have high computational complexity. FILIP�[4] proposed a refined multi-modal alignment operation after the encoder, relying solely on image patches and text tokens. Although this further explores the local feature relationships between multi-modal data, it still requires sufficient data support. When training on small-scale datasets, especially in the medical field, accurately learning alignment features between modalities becomes more challenging�[5, 6].

To address the scarcity of medical data, studies�[7, 8] have introduced self-supervised training into the CLIP framework to further enhance encoder performance. Additionally, weak labels between images and texts have been incorporated during pre-training to aid multi-modal alignment�[9]. Some studies�[10, 11] utilized fine-grained alignment between chest image patches and text tokens for pre-training�[4]. However, unlike natural images and text, the relationship between medical images and diagnostic text is often more complex and challenging to learn. Moreover, with insufficient data, models are prone to learning shortcut features unrelated to disease diagnosis, resulting in poor generalization ability�[12, 13, 14]. Therefore, it is crucial to learn useful alignment information from relatively limited medical multi-model datasets.

Refer to caption
Figure 1: The guiding role of radiologists’ eye-gaze data. The text provided by radiologists during diagnosis aligns naturally with the attention regions.

In this study, we fully explore the auxiliary role of eye-gaze data from radiologists in multi-model alignment. Eye-gaze data can intuitively reflect the image regions radiologists focus on, providing insights into their cognitive behavior during diagnosis [15]. Therefore, compared to refined annotations like bounding boxes and masks, eye-gaze data can also provide useful auxiliary information for the model [16, 17, 14]. Moreover, collecting eye-gaze data from radiologists during the diagnostic process is more time-efficient than annotating bounding boxes and masks [18, 14]. For the multi-modal medical dataset, EYE GAZE [19] and REFLACX [20] collected eye-gaze data from radiologists while diagnosing chest X-rays. Additionally, these datasets recorded synchronized voice data, where radiologists verbalized their diagnoses while observing the images. As shown in Fig. 1, we found that the radiologists’ attention regions on the image naturally align with the diagnostic text over time. Therefore, we believe this type of eye-gaze data can provide expert prior knowledge for training the alignment between medical visual and textual features. Thus, considering the utilization of eye-gaze data to assist in multi-modal model training, we propose the Eye-gaze Guided Multi-modal Alignment framework (EGMA). Our model first segments the transcribed text into individual sentences and obtains radiologists’ attention heatmaps. Subsequently, we obtain encoded features of image patches and sentences through image and text encoders, generating instance-level similarity matrix. Then, we compute the loss between this matrix and the attention heatmaps, integrating refined feature representations for subsequent contrastive loss. To further leverage the assisting role of eye-gaze data in aligning images and texts, we combine the eye-gaze heatmaps with the similarity matrix derived from model, serving as weights to calculate cross-modality mapping loss. Experimental results on zero-shot classification and retrieval tasks reveal that our framework surpasses other leading methods in performance across diverse datasets and under multiple dataset size scenarios. Specifically, the EGMA framework yielded a remarkable 3.9% improvement in image-to-text matching tasks and an impressive 19.75% increase in text-to-image matching tasks. These results underscore the cutting-edge and efficacious nature of our approach, highlighting its substantial advancements over existing methodologies. We also explore the auxiliary effect of using eye-gaze data of different scales on the model, finding that even a small portion of eye-gaze data can enhance the model’s multi-modal processing capability. Moreover, the fine-tuned classification results of EGMA achieved the best performance across multiple datasets. The code of this work is available on 111https://github.com/Momarky/EGMA.

In summary, the main contributions of this work are as follows:

  • We propose EGMA, a novel framework for medical multi-modal alignment, marking the first attempt to integrate eye-gaze data into vision-language pre-training.

  • EGMA outperforms existing state-of-the-art medical multi-modal pre-training methods, and realizes notable enhancements in image classification and image-text retrieval tasks.

  • EGMA demonstrates that even a small amount of eye-gaze data can effectively assist in multi-modal pre-training and improve the feature representation ability of the model.

II Related Works

II-A Medical Vision-language Pre-training (Med-VLP)

In the pursuit of Artificial General Intelligence (AGI), Vision-language Pre-training (VLP) has emerged as a pivotal area in AI research.

The advent of the transformer architecture [21] has not only initiated a new chapter in the integration of vision and language but has also significantly accelerated the progress in the multi-modal domain. During this phase, VLP frameworks predominantly focused on the development of fusion encoders. These frameworks employed cross-attention mechanisms to amalgamate visual and textual features [22, 23], commonly adopting a dual-stream architecture. The introduction of CLIP [1] marked a significant breakthrough in the VLP field, leading to the genesis of a plethora of CLIP-based VLP frameworks. These frameworks integrate contrastive loss as a fundamental component [4, 2], thereby enriching the scope and effectiveness of VLP methodologies.

In the medical field, rapid advancements have also been made in multi-modal pre-training. ConVIRT [24] serves as an equivalent to CLIP [1] in the medical domain. MedCLIP [9] ingeniously addressed the challenge of insufficient paired image-text data in healthcare by integrating knowledge extraction techniques to decouple image-text pairs. Similarly, BioViL [25] demonstrates enhanced performance through the training of specialized biomedical text BERT encoders in contrastive learning tasks. In terms of multi-level alignment, GLoRIA�[10] proposed multi-modal global-local representation learning of instance-level and token-level. MGCA [11] introduced alignment at three levels: pathological region, instance, and disease. Furthermore, study [26] incorporated knowledge bases to infuse expert knowledge from the medical domain into the system.

II-B Eye-tracking Technology in Radiology

In the realm of medical imaging diagnostics, the visual analysis performed by professional radiologists plays a decisive role. A key technique in this domain is eye-tracking, which has demonstrated its value in radiological research over the past several decades [27]. Early investigations have found that experienced radiologists are able to quickly identify hidden lesions through comprehensive observation, a process that relies on their broader field of view and extensive professional knowledge [28, 15]. For instance, Ellen et al. [29] have revealed how seasoned radiologists systematically examine standard chest X-rays, in stark contrast to novice doctors.

In the field of medical deep learning, the integration of radiologists’ eye-gaze data has been a significant advancement. Khosravan et al. [16] successfully merged this data with Convolutional Neural Networks (CNNs) for enhanced lesion detection accuracy. Exploring further, Mall et al. [30] delved into the visual search patterns in mammography, establishing a crucial link between human visual attention and CNN performance in detecting mammogram lesions. Karargyris et al. [31] contributed by developing a comprehensive dataset that includes both eye-gaze data and disease diagnosis, facilitating multi-task processing in this domain. In a similar vein, Wang et al. [17] innovated by introducing an attention consistency module, which harnessed radiologists’ visual attention to improve the accuracy of CNNs in diagnosing osteoarthritis from knee X-ray images. Building on these advancements, Ma et al. [14] recently explored the integration of eye-gaze data with advanced Vision Transformer (ViT) models, pushing the boundaries of medical image processing even further.

In the exploration of multi-modal tasks, Men et al.�[32] innovatively crafted a multi-modal guidance system. This system adeptly replicates the combined dynamics of eye tracking and probe manipulation as performed by sonographers in obstetric ultrasound examinations. By effectively mirroring the expertise of medical professionals, the system significantly elevates the accuracy and efficiency of ultrasound scanning. Nonetheless, the integration of these eye-gaze data with image-text alignment strategies for enhancing the effectiveness of medical vision-language models represents an ongoing area of research yet to be fully resolved.

III Method

Refer to caption
Figure 2: The framework of EGMA. After images and text are processed by the encoder in Part A𝐴Aitalic_A, patch feature and sentence feature representations are obtained, resulting in a fine-grained similarity matrix for instances. Subsequently, the two types of eye-gaze-based auxiliary information obtained in Part B𝐵Bitalic_B are used for fine-grained and cross-mapping alignment in Part C𝐶Citalic_C and Part D𝐷Ditalic_D, respectively.

As shown in Fig. 2, the framework of our proposed method consists of four main components. Firstly, we extract features from image and text in part A𝐴Aitalic_A to obtain a refined instance-level similarity matrix. Secondly, in part B𝐵Bitalic_B, we integrate textual transcripts derived from radiologists’ audio, images, and eye-gaze data, to visualize and map radiologists’ attention onto specific regions of images during diagnosis. This process establishes alignment between texts and images, facilitating model training. The detailed gaze data processing methods are described in Sec. III-A. Given that eye-gaze data tightly links textual and localized visual information, after obtaining auxiliary information from part B𝐵Bitalic_B, we introduce eye-gaze guided refined alignment training strategies, as depicted in Parts C𝐶Citalic_C and D𝐷Ditalic_D of Fig. 2. Specifically, we introduce the optimization algorithm for eye-gaze guided fine-grained text-image similarity matrix in Part C𝐶Citalic_C in Sec. III-B. Finally, in Sec. III-C, we present the algorithm for eye-gaze guided cross-modality mapping.

III-A Multi-modal Data Processing

With the development of data collection technologies such as eye-tracking and speech recognition, it has become possible to collect and process multi-modal interaction data of radiologists during the diagnostic process. In this work, we utilize MIMIC-EYE�[33] datasets as our training set, consisting of 3689 images extracted from the MIMIC datasets�[34, 35, 36, 37]. Each sample is accompanied by corresponding eye-tracking data and transcripts text. These eye-tracking data are provided by the publicly available EYE GAZE�[19] and REFLACX�[20] datasets on PhysioNet�[38]. Since each modality is synchronized, the audio data is aligned with the eye-gaze data in time. By segmenting the audio based on the time before and after the pronunciation of each word, we can align the transcripts with the audio, thereby aligning sentence-level text with eye-gaze data. Subsequently, we generate attention heatmap based on eye-gaze data and images to represent the image regions the radiologist focuses on. Through the aforementioned data processing steps, we achieve precise alignment between sentence-level text and image regions. Detailed processing method of eye-gaze and audio transcripts can be found at Supplementary Materials.

III-B Eye-gaze Guided Fine-grained Alignment

The core idea of contrastive learning is to bring the features of related samples closer while pushing away the features of unrelated samples. During the training progress of CLIP [1] model, assuming a batch size of b𝑏bitalic_b and input data {xkI,xkT}superscriptsubscript𝑥𝑘𝐼superscriptsubscript𝑥𝑘𝑇\left\{x_{k}^{I},x_{k}^{T}\right\}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } (k=1,,b𝑘1𝑏k=1,\cdots,bitalic_k = 1 , ⋯ , italic_b) representing image-text pairs, global features zkI=EI(xkI)1×dsuperscriptsubscript𝑧𝑘𝐼subscript𝐸𝐼superscriptsubscript𝑥𝑘𝐼superscript1𝑑z_{k}^{I}=E_{I}(x_{k}^{I})\in\mathbb{R}^{1\times d}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT and zkT=ET(xkT)1×dsuperscriptsubscript𝑧𝑘𝑇subscript𝐸𝑇superscriptsubscript𝑥𝑘𝑇superscript1𝑑z_{k}^{T}=E_{T}(x_{k}^{T})\in\mathbb{R}^{1\times d}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT are obtained through image encoder EIsubscript𝐸𝐼E_{I}italic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and text encoder ETsubscript𝐸𝑇E_{T}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Subsequently, the cosine similarity sk,lI2Tsuperscriptsubscript𝑠𝑘𝑙𝐼2𝑇s_{k,l}^{I2T}italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT and sk,lT2Isuperscriptsubscript𝑠𝑘𝑙𝑇2𝐼s_{k,l}^{T2I}italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT between the two modalities is computed, with the following formula:

sk,lI2T=COS(zkI,zlT),sk,lT2I=COS(zkT,zlI)1lbformulae-sequencesuperscriptsubscript𝑠𝑘𝑙𝐼2𝑇𝐶𝑂𝑆superscriptsubscript𝑧𝑘𝐼superscriptsubscript𝑧𝑙𝑇superscriptsubscript𝑠𝑘𝑙𝑇2𝐼𝐶𝑂𝑆superscriptsubscript𝑧𝑘𝑇superscriptsubscript𝑧𝑙𝐼1𝑙𝑏s_{k,l}^{I2T}=COS(z_{k}^{I},z_{l}^{T}),\ s_{k,l}^{T2I}=COS(z_{k}^{T},z_{l}^{I}% )\quad 1\leqslant l\leqslant b\\ italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT = italic_C italic_O italic_S ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT = italic_C italic_O italic_S ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) 1 ⩽ italic_l ⩽ italic_b (1)

where sk,lI2Tsuperscriptsubscript𝑠𝑘𝑙𝐼2𝑇s_{k,l}^{I2T}italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT is the image-to-text similarity, sk,lT2Isuperscriptsubscript𝑠𝑘𝑙𝑇2𝐼s_{k,l}^{T2I}italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT is the text-to-image similarity, and l𝑙litalic_l is the index number of the another modality. Then, the image-to-text contrastive loss LkI2Tsuperscriptsubscript𝐿𝑘𝐼2𝑇L_{k}^{I2T}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT for xkIsuperscriptsubscript𝑥𝑘𝐼x_{k}^{I}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and text-to-image contrastive loss LkT2Isuperscriptsubscript𝐿𝑘𝑇2𝐼L_{k}^{T2I}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT for xkTsuperscriptsubscript𝑥𝑘𝑇x_{k}^{T}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT can be formulated as:

LkI2T(xkI,{xlT}l=1b)=1blogexp(sk,kI2T/τ)l(exp(sk,lI2T/τ))superscriptsubscript𝐿𝑘𝐼2𝑇superscriptsubscript𝑥𝑘𝐼superscriptsubscriptsuperscriptsubscript𝑥𝑙𝑇𝑙1𝑏1𝑏𝑙𝑜𝑔𝑒𝑥𝑝superscriptsubscript𝑠𝑘𝑘𝐼2𝑇𝜏subscript𝑙𝑒𝑥𝑝superscriptsubscript𝑠𝑘𝑙𝐼2𝑇𝜏\displaystyle L_{k}^{I2T}(x_{k}^{I},\left\{x_{l}^{T}\right\}_{l=1}^{b})=-\frac% {1}{b}log\frac{exp(s_{k,k}^{I2T}/\tau)}{\textstyle\sum_{l}(exp(s_{k,l}^{I2T}/% \tau))}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , { italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_b end_ARG italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_e italic_x italic_p ( italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT / italic_τ ) ) end_ARG (2)
LkT2I(xkT,{xlI}l=1b)=1blogexp(sk,kT2I/τ)l(exp(sk,lT2I/τ))superscriptsubscript𝐿𝑘𝑇2𝐼superscriptsubscript𝑥𝑘𝑇superscriptsubscriptsuperscriptsubscript𝑥𝑙𝐼𝑙1𝑏1𝑏𝑙𝑜𝑔𝑒𝑥𝑝superscriptsubscript𝑠𝑘𝑘𝑇2𝐼𝜏subscript𝑙𝑒𝑥𝑝superscriptsubscript𝑠𝑘𝑙𝑇2𝐼𝜏\displaystyle L_{k}^{T2I}(x_{k}^{T},\left\{x_{l}^{I}\right\}_{l=1}^{b})=-\frac% {1}{b}log\frac{exp(s_{k,k}^{T2I}/\tau)}{\textstyle\sum_{l}(exp(s_{k,l}^{T2I}/% \tau))}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , { italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_b end_ARG italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_e italic_x italic_p ( italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT / italic_τ ) ) end_ARG

where τ𝜏\tauitalic_τ is a learned temperature. It is worth noting that in the calculation of the loss mentioned above, both the image and text utilize global-level features, while the auxiliary information generated from eye-gaze data emphasizes the local-level features between modalities. Therefore, based on [4], we replace instance feature zkIsuperscriptsubscript𝑧𝑘𝐼z_{k}^{I}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and zkTsuperscriptsubscript𝑧𝑘𝑇z_{k}^{T}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with Pknn×dsuperscriptsubscript𝑃𝑘𝑛superscript𝑛𝑑P_{k}^{n}\in\mathbb{R}^{n\times d}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and Skmm×dsuperscriptsubscript𝑆𝑘𝑚superscript𝑚𝑑S_{k}^{m}\in\mathbb{R}^{m\times d}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT, where Pkisuperscriptsubscript𝑃𝑘𝑖P_{k}^{i}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT(1in1𝑖𝑛1\leqslant i\leqslant n1 ⩽ italic_i ⩽ italic_n) is the i𝑖iitalic_i-th patch feature of xkIsuperscriptsubscript𝑥𝑘𝐼x_{k}^{I}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and Skjsuperscriptsubscript𝑆𝑘𝑗S_{k}^{j}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT(1jm1𝑗𝑚1\leqslant j\leqslant m1 ⩽ italic_j ⩽ italic_m) is the j𝑗jitalic_j-th sentence feature of xkTsuperscriptsubscript𝑥𝑘𝑇x_{k}^{T}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and n,m𝑛𝑚n,mitalic_n , italic_m are the image patch number and the sentence number of report. Then we calculate the similarities of sentence-to-patch xkS2Pm×nsuperscriptsubscript𝑥𝑘𝑆2𝑃superscript𝑚𝑛x_{k}^{S2P}\in\mathbb{R}^{m\times n}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT and patch-to-sentence xkP2Sn×msuperscriptsubscript𝑥𝑘𝑃2𝑆superscript𝑛𝑚x_{k}^{P2S}\in\mathbb{R}^{n\times m}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT in one instance:

xkS2P=COS(Skj,Pki),xkP2S=COS(Pki,Skj)formulae-sequencesuperscriptsubscript𝑥𝑘𝑆2𝑃𝐶𝑂𝑆superscriptsubscript𝑆𝑘𝑗superscriptsubscript𝑃𝑘𝑖superscriptsubscript𝑥𝑘𝑃2𝑆𝐶𝑂𝑆superscriptsubscript𝑃𝑘𝑖superscriptsubscript𝑆𝑘𝑗x_{k}^{S2P}=COS(S_{k}^{j},P_{k}^{i}),\ x_{k}^{P2S}=COS(P_{k}^{i},S_{k}^{j})italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT = italic_C italic_O italic_S ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT = italic_C italic_O italic_S ( italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (3)

For each heatmap corresponding to a sentence, we initially divide it into n𝑛nitalic_n patches similar to the image. Subsequently, we concatenate the heatmaps of m𝑚mitalic_m sentences to obtain the Gaze-guided Similarity matrix GSk𝐺subscript𝑆𝑘GS_{k}italic_G italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for input {xkI,xkT}superscriptsubscript𝑥𝑘𝐼superscriptsubscript𝑥𝑘𝑇\left\{x_{k}^{I},x_{k}^{T}\right\}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } (as illustrated in Fig. 2.B). In this matrix, non-zero elements indicate the semantic correlation between the corresponding sentences and image patches. Thus, we binarize GSk𝐺subscript𝑆𝑘GS_{k}italic_G italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, setting non-zero regions to 1, resulting in the Gaze-guided Label matrix GLk𝐺subscript𝐿𝑘GL_{k}italic_G italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. After this step, we compute the multi-label cross-entropy (MLCE) loss for xkS2Psuperscriptsubscript𝑥𝑘𝑆2𝑃x_{k}^{S2P}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT and xkP2Ssuperscriptsubscript𝑥𝑘𝑃2𝑆x_{k}^{P2S}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT, completing the optimization for fine-grained alignment between positive sample pairs {xkI,xkT}superscriptsubscript𝑥𝑘𝐼superscriptsubscript𝑥𝑘𝑇\left\{x_{k}^{I},x_{k}^{T}\right\}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }, as follows:

fLkS2P=mlce(xkS2P,GLk)𝑓superscriptsubscript𝐿𝑘𝑆2𝑃𝑚𝑙𝑐𝑒superscriptsubscript𝑥𝑘𝑆2𝑃𝐺subscript𝐿𝑘\displaystyle fL_{k}^{S2P}=mlce(x_{k}^{S2P},GL_{k})italic_f italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT = italic_m italic_l italic_c italic_e ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT , italic_G italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (4)
fLkP2S=mlce(xkP2S,(GLk)T)𝑓superscriptsubscript𝐿𝑘𝑃2𝑆𝑚𝑙𝑐𝑒superscriptsubscript𝑥𝑘𝑃2𝑆superscript𝐺subscript𝐿𝑘T\displaystyle fL_{k}^{P2S}=mlce(x_{k}^{P2S},(GL_{k})^{\rm T})italic_f italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT = italic_m italic_l italic_c italic_e ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT , ( italic_G italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT )

where mlce𝑚𝑙𝑐𝑒mlceitalic_m italic_l italic_c italic_e is the multi-label cross-entropy loss. Subsequently, we calculate the fine-grained features z^kI=Meani(Maxj(xkP2S))superscriptsubscript^𝑧𝑘𝐼𝑀𝑒𝑎subscript𝑛𝑖𝑀𝑎subscript𝑥𝑗superscriptsubscript𝑥𝑘𝑃2𝑆\hat{z}_{k}^{I}=Mean_{i}(Max_{j}(x_{k}^{P2S}))over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = italic_M italic_e italic_a italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_M italic_a italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT ) ) and z^kT=Meanj(Maxi(xkS2P))superscriptsubscript^𝑧𝑘𝑇𝑀𝑒𝑎subscript𝑛𝑗𝑀𝑎subscript𝑥𝑖superscriptsubscript𝑥𝑘𝑆2𝑃\hat{z}_{k}^{T}=Mean_{j}(Max_{i}(x_{k}^{S2P}))over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_M italic_e italic_a italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_M italic_a italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT ) ). Then, we replace the zkI,zkTsuperscriptsubscript𝑧𝑘𝐼superscriptsubscript𝑧𝑘𝑇z_{k}^{I},z_{k}^{T}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with the updated z^kI,z^kTsuperscriptsubscript^𝑧𝑘𝐼superscriptsubscript^𝑧𝑘𝑇\hat{z}_{k}^{I},\hat{z}_{k}^{T}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in Eq. 1. Finally, the fine-grained image-to-text loss L^kI2Tsuperscriptsubscript^𝐿𝑘𝐼2𝑇\hat{L}_{k}^{I2T}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT and text-to-image loss L^kT2Isuperscriptsubscript^𝐿𝑘𝑇2𝐼\hat{L}_{k}^{T2I}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT are computed based on Eq. 2. The formula for our Eye-gaze Guided Fine-grained (EGF) alignment loss is as follows:

LEGF=12bk=1b(fLkS2P+fLkP2S)+12k=1b(L^kT2I+L^kI2T)subscript𝐿𝐸𝐺𝐹12𝑏superscriptsubscript𝑘1𝑏𝑓superscriptsubscript𝐿𝑘𝑆2𝑃𝑓superscriptsubscript𝐿𝑘𝑃2𝑆12superscriptsubscript𝑘1𝑏superscriptsubscript^𝐿𝑘𝑇2𝐼superscriptsubscript^𝐿𝑘𝐼2𝑇L_{EGF}=\frac{1}{2b}\sum_{k=1}^{b}(fL_{k}^{S2P}+fL_{k}^{P2S})+\frac{1}{2}\sum_% {k=1}^{b}(\hat{L}_{k}^{T2I}+\hat{L}_{k}^{I2T})italic_L start_POSTSUBSCRIPT italic_E italic_G italic_F end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_f italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT + italic_f italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT + over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT ) (5)

III-C Eye-gaze Guided Cross-modality Mapping

In the previous section, we replaced the global instance logits in the traditional batch clip loss with fine-grained instance logits that consider local features and optimized the alignment between these local features using gaze information. The text in our work is recorded by radiologists while observing images, implying a close semantic relationship between the focus region and the corresponding text. To further optimize the alignment between modalities, we continue to incorporate eye-gaze data assistance into the cross-modality mapping process. In this work, we first utilize matrices GSk𝐺subscript𝑆𝑘GS_{k}italic_G italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, xkP2Ssuperscriptsubscript𝑥𝑘𝑃2𝑆x_{k}^{P2S}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT and xkS2Psuperscriptsubscript𝑥𝑘𝑆2𝑃x_{k}^{S2P}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT to generate the image-to-text and text-to-image alignment weight matrix WI2Tn×msuperscript𝑊𝐼2𝑇superscript𝑛𝑚W^{I2T}\in\mathbb{R}^{n\times m}italic_W start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT and WT2Im×nsuperscript𝑊𝑇2𝐼superscript𝑚𝑛W^{T2I}\in\mathbb{R}^{m\times n}italic_W start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT. The calculation formula is as follows:

WI2T=norm(ω(xkP2S)+GSk)superscript𝑊𝐼2𝑇𝑛𝑜𝑟𝑚𝜔superscriptsubscript𝑥𝑘𝑃2𝑆𝐺subscript𝑆𝑘\displaystyle W^{I2T}=norm(\omega(x_{k}^{P2S})+GS_{k})italic_W start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT = italic_n italic_o italic_r italic_m ( italic_ω ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P 2 italic_S end_POSTSUPERSCRIPT ) + italic_G italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (6)
WT2I=norm(ω(xkS2P)+(GSk)T)superscript𝑊𝑇2𝐼𝑛𝑜𝑟𝑚𝜔superscriptsubscript𝑥𝑘𝑆2𝑃superscript𝐺subscript𝑆𝑘T\displaystyle W^{T2I}=norm(\omega(x_{k}^{S2P})+(GS_{k})^{\rm T})italic_W start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT = italic_n italic_o italic_r italic_m ( italic_ω ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S 2 italic_P end_POSTSUPERSCRIPT ) + ( italic_G italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT )

where norm𝑛𝑜𝑟𝑚normitalic_n italic_o italic_r italic_m is normalization and ω𝜔\omegaitalic_ω consists of sparse and binarize operations. After obtaining the weight matrix, we perform the mapping from text features Skmsuperscriptsubscript𝑆𝑘𝑚S_{k}^{m}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT to image features Cross_Pknn×d𝐶𝑟𝑜𝑠𝑠_superscriptsubscript𝑃𝑘𝑛superscript𝑛𝑑Cross\_P_{k}^{n}\in\mathbb{R}^{n\times d}italic_C italic_r italic_o italic_s italic_s _ italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and from image features Pknsuperscriptsubscript𝑃𝑘𝑛P_{k}^{n}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to text features Cross_Skmm×d𝐶𝑟𝑜𝑠𝑠_superscriptsubscript𝑆𝑘𝑚superscript𝑚𝑑Cross\_S_{k}^{m}\in\mathbb{R}^{m\times d}italic_C italic_r italic_o italic_s italic_s _ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT according to the following formula:

Cross_Pki=j=1mSkjWijI2T,Cross_Skj=i=1nPkiWjiT2Iformulae-sequence𝐶𝑟𝑜𝑠𝑠_superscriptsubscript𝑃𝑘𝑖superscriptsubscript𝑗1𝑚superscriptsubscript𝑆𝑘𝑗subscriptsuperscript𝑊𝐼2𝑇𝑖𝑗𝐶𝑟𝑜𝑠𝑠_superscriptsubscript𝑆𝑘𝑗superscriptsubscript𝑖1𝑛superscriptsubscript𝑃𝑘𝑖subscriptsuperscript𝑊𝑇2𝐼𝑗𝑖Cross\_P_{k}^{i}={\sum_{j=1}^{m}S_{k}^{j}\cdot W^{I2T}_{ij}},\ Cross\_S_{k}^{j% }={\sum_{i=1}^{n}P_{k}^{i}\cdot W^{T2I}_{ji}}italic_C italic_r italic_o italic_s italic_s _ italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ italic_W start_POSTSUPERSCRIPT italic_I 2 italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_C italic_r italic_o italic_s italic_s _ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_W start_POSTSUPERSCRIPT italic_T 2 italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT (7)

where i[1,n]𝑖1𝑛i\in[1,n]italic_i ∈ [ 1 , italic_n ] is the i𝑖iitalic_i-th patch feature of Pknsuperscriptsubscript𝑃𝑘𝑛P_{k}^{n}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and j[1,m]𝑗1𝑚j\in[1,m]italic_j ∈ [ 1 , italic_m ] is the j𝑗jitalic_j-th sentence feature of Skmsuperscriptsubscript𝑆𝑘𝑚S_{k}^{m}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Subsequently, we use the mapped features along with the target features as inputs to compute the alignment contrastive loss defined in Eq. 2, obtaining the image mapping loss mLkI𝑚superscriptsubscript𝐿𝑘𝐼mL_{k}^{I}italic_m italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and the text mapping loss mLkT𝑚superscriptsubscript𝐿𝑘𝑇mL_{k}^{T}italic_m italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The formula for our Eye-gaze Guided cross-model Mapping (EGM) loss is as follows:

LEGM=12k=1b(mLkI+mLkT)subscript𝐿𝐸𝐺𝑀12superscriptsubscript𝑘1𝑏𝑚superscriptsubscript𝐿𝑘𝐼𝑚superscriptsubscript𝐿𝑘𝑇L_{EGM}=\frac{1}{2}\sum_{k=1}^{b}(mL_{k}^{I}+mL_{k}^{T})italic_L start_POSTSUBSCRIPT italic_E italic_G italic_M end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_m italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + italic_m italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (8)

Finally, the total loss of our model within a batch is L=LEGF+LEGM𝐿subscript𝐿𝐸𝐺𝐹subscript𝐿𝐸𝐺𝑀L=L_{EGF}+L_{EGM}italic_L = italic_L start_POSTSUBSCRIPT italic_E italic_G italic_F end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_E italic_G italic_M end_POSTSUBSCRIPT. In our training process, considering the proportion of eye-gaze data, batches may contain both types of data. When encountering samples without eye-gaze data, the EGF module does not compute the loss from Eq. 4, and the weight matrix in the Eq. 6 of EGM module also excludes the GSk𝐺subscript𝑆𝑘GS_{k}italic_G italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

IV Experiments

In this study, we first conduct supervised and zero-shot classification as well as zero-shot retrieval experiments in Sec. IV-A to validate the model’s generalization performance and its representation capability of multi-modal features. Then, in Sec. IV-B, we perform ablation studies on various modules of EGMA. Additionally, to further investigate the auxiliary effect of eye-gaze data, we compare the performance when guided by different amounts of eye-gaze data. Finally, in Sec. IV-C, we visualize the model’s feature representations and the learned image-text relationships, further demonstrating the model’s performance and interpretability.

TABLE I: Comparison results of supervised classification task with other SOTA models on CheXpert, RSNA, and SIIM-ACR datasets. Area under ROC curve (AUROC) is reported with different portions of training data: 1%, 10%, 100%. Red and blue denote the best and second-best results.
Method CheXpert[39] RSNA [40] SIIM-ACR [41]
1% 10% 100% 1% 10% 100% 1% 10% 100%
ConVIRT [24] 85.90 86.80 87.30 77.40 80.10 88.60 - - -
BioViL [25] 81.95 85.37 88.62 81.76 85.68 88.64 80.26 82.79 90.51
MedKLIP [26] - - - 87.31 87.99 89.31 85.27 90.71 91.88
MGCA [11] 85.80 87.66 89.30 85.22 87.54 89.24 86.12 89.66 92.16
GLoRIA [10] 86.60 87.80 88.10 86.10 88.00 88.60 - - -
PRIOR [43] 86.16 87.08 89.08 86.72 88.07 89.19 88.35 89.72 92.49
MedCLIP [9] 85.74 87.49 88.02 87.61 88.19 89.10 88.84 91.13 92.18
EGMA(Ours) 87.71 88.92 89.50 88.41 89.40 90.10 90.78 92.17 93.29

IV-A Comparison with State-of-the-Arts

Image Classification We conduct supervised classification experiments on the CheXpert [39], RSNA [40], and SIIM-ACR [41] datasets. CheXpert [39] is a large-scale public dataset for chest radiograph interpretation, it comprises 224,316 chest radiographic images. Following [11], we utilize the official training split as our training set, and the official validation set of 202 images with expert-label as our test set. RSNA [40] is a comprehensive dataset for Pneumonia diagnosing. It contains 29,700 chest X-ray images categorized into normal𝑛𝑜𝑟𝑚𝑎𝑙normalitalic_n italic_o italic_r italic_m italic_a italic_l and pneumonia𝑝𝑛𝑒𝑢𝑚𝑜𝑛𝑖𝑎pneumoniaitalic_p italic_n italic_e italic_u italic_m italic_o italic_n italic_i italic_a positive category. We follow [11] to divide the data into 70% for training, 15% for validation, and 15% for testing. SIIM-ACR [41] is a chest dataset used for pneumothorax diagnosing. It consists of 2379 images with pneumothorax and 8300 images without pneumothorax. In this work, we utilize a subset defined in [42] as our test set, with the remaining data used for training and validation. More details of dataset can be found in the supplementary materials.

In the supervised classification experiments, we adopt the linear classification settings [10], where the pre-trained image encoder is frozen, and only a randomly initialized linear classification head is trained. We adopt area under ROC curve (AUROC) metric to evaluate all model’s performance. And for better validate the model’s efficiency, we test its performance using 1%, 10%, and 100% of the training set. As shown in Tab. I, our model achieved the best results compared to other models. Additionally, with only 1% of the training set, our model outperformed the second-best model by 1.11%, 0.8%, and 1.94% on the CheXpert, RSNA, and SIIM-ACR datasets, respectively. Moreover, as the amount of training data increased, the model’s performance improved significantly. This demonstrates that, with the assistance of radiologists’ eye-gaze data, our model possesses strong multi-modal feature representation capabilities.

TABLE II: Comparison results of zero-shot classification tasks with other SOTA models on CheXpert 5x200, RSNA, and SIIM-ACR datasets. The Accuracy (Acc.) and F1-score (F1) metrics are reported. Red and blue denote the best and second-best results.
Method CheXpert 5x200 [10] RSNA [40] SIIM-ACR [41]
Acc.\uparrow F1\uparrow Acc.\uparrow F1\uparrow Acc.\uparrow F1\uparrow
CLIP [1] 20.10 9.12 25.03 22.07 49.39 47.98
GLoRIA [10] 53.30 48.99 29.15 28.54 22.57 22.57
PRIOR [43] 34.90 30.56 76.77 51.80 50.00 33.33
MGCA [11] 43.60 41.37 60.83 57.77 30.03 25.45
MedCLIP [9] 57.50 55.97 43.09 31.01 58.40 57.85
EGMA(Ours) 61.30 60.38 76.97 43.49 63.62 61.46

We further conduct zero-shot classification tasks on the CheXpert5x200 [10], RSNA [40], and SIIM-ACR [41] datasets. CheXpert5x200 includes five common chest diseases, Atelectasis𝐴𝑡𝑒𝑙𝑒𝑐𝑡𝑎𝑠𝑖𝑠Atelectasisitalic_A italic_t italic_e italic_l italic_e italic_c italic_t italic_a italic_s italic_i italic_s, Cardiomegaly𝐶𝑎𝑟𝑑𝑖𝑜𝑚𝑒𝑔𝑎𝑙𝑦Cardiomegalyitalic_C italic_a italic_r italic_d italic_i italic_o italic_m italic_e italic_g italic_a italic_l italic_y, Consolidation𝐶𝑜𝑛𝑠𝑜𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛Consolidationitalic_C italic_o italic_n italic_s italic_o italic_l italic_i italic_d italic_a italic_t italic_i italic_o italic_n, Edema𝐸𝑑𝑒𝑚𝑎Edemaitalic_E italic_d italic_e italic_m italic_a, and PleuralEffusion𝑃𝑙𝑒𝑢𝑟𝑎𝑙𝐸𝑓𝑓𝑢𝑠𝑖𝑜𝑛Pleural\ Effusionitalic_P italic_l italic_e italic_u italic_r italic_a italic_l italic_E italic_f italic_f italic_u italic_s italic_i italic_o italic_n, each with 200 chest X-rays. It is important to note that the CheXpert training set does not include any data from CheXpert5x200, so there is no data leakage issue. The test sets for RSNA and SIIM-ACR are the same as those used in the supervised classification task. All text prompts are provided by a professional radiologist [10]. During testing, we calculated the similarity between image features and text prompt features for all diseases, with the highest similarity indicating the predicted category. As shown in Tab. II, CLIP [1] performs poorly on medical images due to its training data primarily consisting of natural images. The models in rows two to five use encoders pre-trained on medical datasets, and thus, their performance is better than that of CLIP. Interestingly, the GLoRIA [10] and MGCA [11] perform worse than the CLIP model in diagnosing pneumonia on the SIIM-ACR dataset. This indicates that these models are significantly influenced by the data distribution, resulting in poor generalization performance. Conversely, our EGMA achieves the best results in all other metrics, except for the F1-score on the RSNA dataset. This demonstrates that our model, enhanced by eye-gaze data, has learned more generalizable feature relationships between medical images and text, significantly improving its generalization performance.

TABLE III: Comparison results of zero-shot retrieval task with other SOTA models on CheXpert 8x200 dataset. The Precision at Top-1, Top-5, and Top-10 are reported. Red and blue denote the best and second-best results.
Method Image-to-text Text-to-image
P@1\uparrow P@5\uparrow P@10\uparrow P@1\uparrow P@5\uparrow P@10\uparrow
CLIP [1] 12.75 12.48 10.03 5.00 12.50 12.50
MedCLIP [9] 14.50 15.98 15.86 12.50 12.50 15.00
MGCA [11] 35.00 27.80 23.33 45.00 47.50 44.00
GLoRIA [10] 38.75 31.62 24.51 52.50 49.00 50.25
ConVIRT [24] - - - 60.25 60.00 57.50
EGMA(Ours) 42.65 37.50 28.84 80.00 74.50 69.50

Image-text Retrieval To further validate the alignment capability of our model between visual and textual features, we compare the zero-shot retrieval performance of EGMA with other models on CheXpert 8x200 dataset [24]. Unlike CheXpert5x200 [10], CheXpert8x200 includes eight common chest diseases, NoFinding𝑁𝑜𝐹𝑖𝑛𝑑𝑖𝑛𝑔No\ Findingitalic_N italic_o italic_F italic_i italic_n italic_d italic_i italic_n italic_g, Cardiomegaly𝐶𝑎𝑟𝑑𝑖𝑜𝑚𝑒𝑔𝑎𝑙𝑦Cardiomegalyitalic_C italic_a italic_r italic_d italic_i italic_o italic_m italic_e italic_g italic_a italic_l italic_y, Edema𝐸𝑑𝑒𝑚𝑎Edemaitalic_E italic_d italic_e italic_m italic_a, Pneumonia𝑃𝑛𝑒𝑢𝑚𝑜𝑛𝑖𝑎Pneumoniaitalic_P italic_n italic_e italic_u italic_m italic_o italic_n italic_i italic_a, Atelectasis𝐴𝑡𝑒𝑙𝑒𝑐𝑡𝑎𝑠𝑖𝑠Atelectasisitalic_A italic_t italic_e italic_l italic_e italic_c italic_t italic_a italic_s italic_i italic_s, Pneumothorax𝑃𝑛𝑒𝑢𝑚𝑜𝑡𝑜𝑟𝑎𝑥Pneumothoraxitalic_P italic_n italic_e italic_u italic_m italic_o italic_t italic_h italic_o italic_r italic_a italic_x, PleuralEffusion𝑃𝑙𝑒𝑢𝑟𝑎𝑙𝐸𝑓𝑓𝑢𝑠𝑖𝑜𝑛Pleural\ Effusionitalic_P italic_l italic_e italic_u italic_r italic_a italic_l italic_E italic_f italic_f italic_u italic_s italic_i italic_o italic_n, and Fracture𝐹𝑟𝑎𝑐𝑡𝑢𝑟𝑒Fractureitalic_F italic_r italic_a italic_c italic_t italic_u italic_r italic_e, each with 200 chest X-rays and five corresponding text prompts. It is worth noting that the prompts for retrieval tasks are different from those for classification tasks in the previous section, but all are written by board-certified radiologists. In the image-to-text retrieval task, we first compute the similarity between the image and all candidate texts, and then rank the retrieved results. Similarly, in the text-to-image task, we compute the similarity between the textual prompts and all images, and rank the retrieval results. We report Precision at Top-1, Top-5, and Top-10, which reflect how many relevant examples are retrieved. As shown in Tab. III, our model achieves the best results in both retrieve tasks. Our model outperforms the second-best model in the image-to-text and text-to-image retrieval tasks by 3.9%, 5.88%, and 4.33%, and 19.75%, 14.50%, and 12% in terms of P@1, P@5, and P@10 metrics, respectively. This indicates that our model has fully learned the relationship between images and texts, achieving better alignment effects.

IV-B Ablation Study

To further validate the model’s performance, we conducted ablation experiments on the proposed EGF and EGM modules, while also assessing the impact of the proportion of eye-gaze data on the model results. As shown in the upper half of Tab. IV, the first row represents our Baseline model, where we utilize the initialized weights pre-trained on CheXpert [39] and MIMIC-CXR [35] datasets [9]. The second row “MLCE” indicates that within our EGF module, the EGF loss is not further computed beyond the Eq. 4, instead, only the multi-label cross-entropy (MLCE) loss between the eye-gaze guided similarity matrix and the model’s output similarity matrix is calculated. The third row “EGF” utilizes the Eye-gaze Guided Fine-grained loss described in Eq. 5. The fourth row “EGM” indicates that the model is trained solely through the Eye-gaze Guide cross-model Mapping method. Finally, the fifth row presents our proposed EGMA model, which integrates the aforementioned modules guided by eye-gaze data.

In Tab. IV, it can be observed that the method using only gaze-guided MLCE loss significantly improves performance compared to the baseline on CheXpert 5x200 dataset, with a slight improvement on RSNA but a severe decline on SIIM-ACR dataset. However, models using EGF or EGM show significant improvements on SIIM-ACR. This indicates that while MLCE improves performance on some datasets, it simultaneously reduces the model’s generalization ability. Thus, relying solely on simple loss for similarity matrix is insufficient. In this work, by combining eye-gaze guided image-text relationships with fine-grained feature alignment (EGF), although the model’s performance slightly decreases on CheXpert 5x200, its overall generalization improves. Similarly, to enhance the model’s multi-modal alignment ability, introducing eye-gaze guided cross-modal mapping results in improved performance and generalization, with EGM achieving optimal performance on RSNA dataset. Finally, when optimizing both fine-grained alignment and cross-modal alignment using eye-gaze, the model achieves dominant performance on all three datasets, demonstrating further enhancement in generalization.

Numerous studies [16, 17, 14, 13] have demonstrated that training models using eye-gaze data can achieve comparable performance to models trained with fine-grained manual annotations. Meanwhile, the cost of collecting fine-grained manual annotations is significantly higher than that of collecting eye-gaze data. Therefore, incorporating eye-gaze into pre-training tasks is a feasible approach to enhancing model performance. To further validate the efficiency of our model using eye-gaze data, we conduct ablation experiments on the proportion of it in the training set. Our training dataset, MIMIC-EYE, consists of a total of 3695 samples. We perform ablation experiments using 1%, 5%, 10%, and 50% of the eye-gaze data, resulting in 37, 185, 370, and 1848 samples with prior information from radiologists, respectively. We repeat each experiment three times to eliminate the bias caused by random sampling, and report the average results. As shown in the lower part of Tab. IV, the model’s performance on the CheXpert 5x200 dataset improved when trained with 1% of eye-gaze data. However, due to the limited data volume, the model’s performance on other datasets is inferior to the baseline. When increasing the eye-gaze data to 5%, the model shows significant improvements on all three datasets. With the continuous increase in eye-gaze data, the performance of the model also improves. Therefore, even with a small amount of eye-gaze data (185 samples), our framework can effectively guide the model’s multi-modal processing capability, ensuring performance enhancement. This further illustrates the applicability of our model and its low training cost characteristics.

TABLE IV: Comparison results of zero-shot classification ablation experiments on CheXpert 5x200, RSNA, and SIIM-ACR datasets. The Accuracy (Acc.) and F1-score (F1) metrics are reported. Each value in the lower part is the average of three runs. Red and blue denote the best and second-best results.
Method CheXpert5x200 [39] RSNA [40] SIIM-ACR [41]
Acc.\uparrow F1\uparrow Acc.\uparrow F1\uparrow Acc.\uparrow F1\uparrow
Baseline 57.50 55.97 43.09 31.01 58.40 57.85
MLCE 60.90 59.59 47.06 33.04 27.43 22.81
EGF 60.30 58.44 53.81 35.52 63.54 65.70
EGM 59.30 57.74 54.68 35.80 52.61 47.85
Unified(Ours) 61.30 60.38 76.97 43.49 63.62 61.46
1% Gaze 58.93plus-or-minus\pm0.06 56.62plus-or-minus\pm0.05 40.38plus-or-minus\pm0.01 29.75plus-or-minus\pm0.01 57.90plus-or-minus\pm0.21 57.37plus-or-minus\pm0.24
5% Gaze 58.93plus-or-minus\pm.006 56.69plus-or-minus\pm0.06 53.00plus-or-minus\pm0.05 35.13plus-or-minus\pm0.01 59.20plus-or-minus\pm0.11 58.51plus-or-minus\pm0.10
10% Gaze 59.30plus-or-minus\pm0.01 57.82plus-or-minus\pm0.01 53.78plus-or-minus\pm0.01 35.37plus-or-minus\pm.001 58.27plus-or-minus\pm0.11 57.71plus-or-minus\pm0.12
50% Gaze 59.55plus-or-minus\pm±0.07 58.84±plus-or-minus\pm±0.01 58.54±plus-or-minus\pm±0.07 37.01±plus-or-minus\pm±0.20 61.41±plus-or-minus\pm±0.31 58.88±plus-or-minus\pm±0.22

IV-C Visualization

Refer to caption
Figure 3: Results of cross-modality attention maps visualization. Related text content: (a) ”heart size borderline enlarged”; (b) ”increased bibasilar opacities are the combination of increased bilateral pleural effusions and bibasilar atelectasis”.
Refer to caption
Figure 4: t-SNE visualization on CheXpert 5x200 dataset by CLIP and our EGMA. The figures display points of different colors representing various ground truth disease types and their cluster assignments. The color-coded points illustrate the clustering results of each algorithm.

To better demonstrate the correspondence learned by the EGMA framework between text and radiographic images, we conducted a cross-modality attention maps visualization in Fig. 3. Guided by eye-gaze data, the EGMA framework clearly outperforms other state-of-the-art methods in the field in accurately localizing disease regions. In Fig. 4, we visualize the feature representations of CLIP [1] and our EGMA model on images of CheXpert 5x200 dataset using the t-SNE [44]. It can be observed that our model exhibits better clustering representation. The CLIP model, which was not trained on medical data, is unable to effectively differentiate these diseases. More results of t-SNE visualization can be referred to the supplementary materials, clustering performance of other SOTA methods [10, 11] also inferior to our EGMA.

V Discussion and Conclusion

In this work, we reveal the significant role of radiologists’ eye-gaze data in multi-modal alignment and propose an Eye-gaze Guided Multi-modal Alignment framework called EGMA. Our framework first processes eye-gaze data into token-level relation matrices, then utilizes these matrices to optimize fine-grained alignment between image patches and text tokens. Furthermore, the framework integrates cross-modal mapping, leveraging eye-gaze data to guide feature mapping between images and texts bidirectionally, thereby enhancing the model’s ability to handle multi-modal data. We evaluate EGMA’s zero-shot capabilities and fine-tuned performances on multiple datasets and observe significant improvement in classification and retrieval tasks. Additionally, we investigate the impact of eye-gaze data scale on performance, finding that even small amounts of eye-gaze data can enhance the model’s multi-modal alignment capabilities during pre-training. Overall, our EGMA framework explores the feasibility of incorporating eye-gaze data from radiologists to assist in multi-modal feature alignment during model training, laying the foundation for the application of eye-gaze data in the medical multi-modal domain.

V-A Limitations and Discussion

Our work only compared state-of-the-art methods in classification and retrieval tasks, without conducting downstream tasks such as lesion localization or segmentation. Additionally, our model heavily relies on multi-modal datasets like MIMIC-EYE [33], which can simultaneously collect eye-gaze data, medical images, and diagnostic text. The scenarios for collecting these data are also a significant consideration. For instance, in clinical ultrasound diagnosis [32], radiologists often use both hands to operate the equipment and verbally communicate their diagnostic information to an assistant. In this context, it is convenient to simultaneously record ultrasound images, eye-gaze data, and audio. In contrast, during chest X-ray diagnosis in MIMIC-EYE, radiologists typically record diagnostic information directly in text form rather than verbally. Fortunately, some recent efforts [18, 14] are focusing on how to naturally collect multi-modal data of radiologists during diagnosing. They have designed more flexible collection systems that better accommodate the routine work of radiologists, which is crucial for the widespread adoption of collecting multi-modal diagnostic data such as eye-gaze information.

V-B Potential Impacts

Although the eye-gaze data we used is publicly available and we have permission to use it, some studies [45, 46] have indicated that private information such as gender, age, and mental state of observers can be extracted from eye-gaze data. Therefore, privacy concerns have always been a focal point in using eye-gaze data. To address this, we recommend using de-identification methods to filter eye-gaze data or releasing the data in the form of heatmaps rather than the raw data.

V-C Future Work

In the future, we will continue to optimize these proposed collection systems [45, 46] and explore the guidance role of eye-gaze data between images and handwritten diagnostic reports to accelerate their application in real medical diagnostic scenarios. This will provide a research foundation to alleviate data annotation pressure and enhance model interpretability. Additionally, we will continue to analysis the eye-gaze features, such as temporal features, and further optimize their role in multi-modal feature alignment. We believe this work can serve as a valuable reference for the application of eye-gaze data in multi-modal frameworks and promote its development in the field of medical multi-modality.

References

  • [1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [2] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
  • [3] Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
  • [4] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image pre-training,” in International Conference on Learning Representations, 2021.
  • [5] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829.
  • [6] Z. Zhao, Y. Liu, H. Wu, Y. Li, S. Wang, L. Teng, D. Liu, X. Li, Z. Cui, Q. Wang et al., “Clip in medical imaging: A comprehensive survey,” arXiv preprint arXiv:2312.07353, 2023.
  • [7] W. Chen, X. Li, L. Shen, and Y. Yuan, “Fine-grained image-text alignment in medical imaging enables cyclic image-report generation,” arXiv preprint arXiv:2312.08078, 2023.
  • [8] K. Zhang, Y. Yang, J. Yu, H. Jiang, J. Fan, Q. Huang, and W. Han, “Multi-task paired masking with alignment modeling for medical vision-language pre-training,” IEEE Transactions on Multimedia, 2023.
  • [9] Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” arXiv preprint arXiv:2210.10163, 2022.
  • [10] S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3942–3951.
  • [11] F. Wang, Y. Zhou, S. Wang, V. Vardhanabhuti, and L. Yu, “Multi-granularity cross-modal alignment for generalized medical visual representation learning,” arXiv preprint arXiv:2210.06044, 2022.
  • [12] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. S. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020.
  • [13] C. Ma, L. Zhao, Y. Chen, L. Guo, T. Zhang, X. Hu, D. Shen, X. Jiang, and T. Liu, “Rectify vit shortcut learning by visual saliency,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [14] C. Ma et al., “Eye-gaze-guided vision transformer for rectifying shortcut learning,” IEEE Transactions on Medical Imaging, vol. 42, no. 11, pp. 3384–3394, Nov 2023.
  • [15] T. Drew, K. Evans, M. L.-H. Võ, F. L. Jacobson, and J. M. Wolfe, “Informatics in radiology: What can you see in a single glance and how might this guide visual search in medical images?” RadioGraphics, vol. 33, no. 1, pp. 263–274, Jan 2013.
  • [16] N. Khosravan et al., “Gaze2segment: A pilot study for integrating eye-tracking technology into medical image segmentation,” in Medical Computer Vision and Bayesian and Graphical Models for Biomedical Imaging.   Cham, Switzerland: Springer, 2016, pp. 94–104.
  • [17] S. Wang, X. Ouyang, T. Liu, Q. Wang, and D. Shen, “Follow my eye: Using gaze to supervise computer-aided diagnosis,” IEEE Trans. Med. Imag., vol. 41, no. 7, pp. 1688–1698, Jul 2022.
  • [18] N. Khosravan, H. Celik, B. Turkbey, E. C. Jones, B. Wood, and U. Bagci, “A collaborative computer aided diagnosis (c-cad) system with eye-tracking, sparse attentional model, and deep learning,” Medical image analysis, vol. 51, pp. 101–115, 2019.
  • [19] A. Karargyris, S. Kashyap, I. Lourentzou, J. Wu, M. Tong, A. Sharma, S. Abedin, D. Beymer, V. Mukherjee, E. Krupinski et al., “Eye gaze data for chest x-rays,” PhysioNet https://doi. org/10.13026/QFDZ-ZR67, 2020.
  • [20] R. B. Lanfredi, M. Zhang, W. Auffermann, J. Chan, P.-A. Duong, V. Srikumar, T. Drew, J. Schroeder, and T. Tasdizen, “Reflacx: Reports and eye-tracking data for localization of abnormalities in chest x-rays,” 2021.
  • [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [22] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, 2019, pp. 13–23.
  • [23] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Association for Computational Linguistics, 2019.
  • [24] Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” in Machine Learning for Healthcare Conference.   PMLR, 2022, pp. 2–25.
  • [25] B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision–language processing,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI.   Springer, 2022, pp. 1–21.
  • [26] C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Medklip: Medical knowledge enhanced language-image pre-training,” medRxiv, pp. 2023–01, 2023.
  • [27] E. A. Krupinski, “Current perspectives in medical image perception,” Attention, Perception, & Psychophysics, vol. 72, no. 5, pp. 1205–1217, Jul 2010.
  • [28] H. L. Kundel, C. F. Nodine, E. F. Conant, and S. P. Weinstein, “Holistic component of image perception in mammogram interpretation: Gaze-tracking study,” Radiology, vol. 242, no. 2, pp. 396–402, Feb 2007.
  • [29] E. M. Kok and H. Jarodzka, “Before your very eyes: The value and limitations of eye tracking in medical education,” Med. Educ., vol. 51, no. 1, pp. 114–122, Jan 2017.
  • [30] S. Mall, E. A. Krupinski, and C. R. Mello-Thoms, “Missed cancer and visual search of mammograms: What feature based machine-learning can tell us that deep-convolution learning cannot,” in Proc. SPIE, Mar 2019, pp. 281–287.
  • [31] A. Karargyris et al., “Creation and validation of a chest x-ray dataset with eye-tracking and report dictation for ai development,” Sci. Data, vol. 8, no. 1, pp. 1–18, Mar 2021.
  • [32] Q. Men, C. Teng, L. Drukker et al., “Gaze-probe joint guidance with multi-task learning in obstetric ultrasound scanning,” Medical Image Analysis, vol. 90, p. 102981, 2023.
  • [33] C. Hsieh, C. Ouyang, J. C. Nascimento, J. Pereira, J. Jorge, and C. Moreira, “Mimic-eye: Integrating mimic datasets with reflacx and eye gaze for multimodal deep learning applications,” PhysioNet (version 1.0. 0), 2023.
  • [34] A. Johnson, T. Pollard, R. Mark, S. Berkowitz, and S. Horng, “Mimic-cxr database,” PhysioNet10, vol. 13026, p. C2JT1Q, 2019.
  • [35] A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
  • [36] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark, “Mimic-iv,” PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021), 2020.
  • [37] A. Johnson, L. Bulgarelli, T. Pollard, L. A. Celi, R. Mark, and S. Horng, “Mimic-iv-ed (version 1.0),” PhysioNet, 2021.
  • [38] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000.
  • [39] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 590–597.
  • [40] G. Shih, C. C. Wu, S. S. Halabi, M. D. Kohli, L. M. Prevedello, T. S. Cook, A. Sharma, J. K. Amorosa, V. Arteaga, M. Galperin-Aizenberg et al., “Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia,” Radiology: Artificial Intelligence, vol. 1, no. 1, 2019.
  • [41] “SIIM-ACR pneumothorax segmentation,” 2020, [online] Available: https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation.
  • [42] K. Saab, S. M. Hooper, N. S. Sohoni, J. Parmar, B. P. Pogatchnik, S. Wu, J. Dunnmon, H. Zhang, D. L. Rubin, and C. Ré, “Observational supervision for medical image classification using gaze data.” in Medical Image Computing and Computer-Assisted Intervention, 2021.
  • [43] P. Cheng, L. Lin, J. Lyu, Y. Huang, W. Luo, and X. Tang, “Prior: Prototype representation joint learning from medical images and reports,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  • [44] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  • [45] J. L. Kröger, O. H.-M. Lutz, and F. Müller, “What does your gaze reveal about you? on the privacy implications of eye tracking,” in IFIP International Summer School on Privacy and Identity Management.   Springer, 2020, pp. 226–241.
  • [46] C. Katsini, Y. Abdrabou, G. E. Raptis, M. Khamis, and F. Alt, “The role of eye gaze in security and privacy applications: Survey and future hci research directions,” in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 2020, pp. 1–21.
  • [47] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  • [48] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,” arXiv preprint arXiv:1904.03323, 2019.
  • [49] D. M. Hansell, A. A. Bankier, H. MacMahon, T. C. McLoud, N. L. Muller, and J. Remy, “Fleischner society: glossary of terms for thoracic imaging,” Radiology, vol. 246, no. 3, pp. 697–722, 2008.

Appendix A Supplementary Materials

The supplementary document is organized as follows. In Sec. B, we provide more experimental settings, including training parameters, detailed parameters of image and text encoders. In Sec. C, we introduce the detailed information of the datasets used in this work. In Sec. D, we provide more details of multi-modal data processing of MIMIC-EYE [33] dataset. In Sec. E, we provide additional visualization results of feature representation. In Sec. F, we provide additional experiments of zero-shot classification task after continue pre-training using the backbones of other SOTA models in our EGMA framework.

Appendix B Experimental Details

B-A Image/Text Encoder

In this study, we use SwinTransformer [47] as the image encoder, BioClinicalBERT [48] as the text encoder. Specifically, we use a 4 stages SwinTransformer, including 2, 2, 6, and 2 SwinTransformer blocks. Other parameters are: patch size 4; window size 7. And we use a 6 layers BioClinicalBERT with 12 attention heads. In our EGMA framework, we add a linear projection layer after both the image encoder and text encoder to map the embeddings’ dimension to 512, and we use a learnable temperature τ𝜏\tauitalic_τ in contrastive loss calculation initialized on 0.07.

B-B Training Settings

Pre-training Settings In the pre-training process, we utilize the following image augmentations to the chest X-ray images: scale to images to 224×224224224224\times 224224 × 224; color jittering with brightness and contrast ratios from [0.8, 1.2]; randomly change the contrast(probability=0.5𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦0.5probability=0.5italic_p italic_r italic_o italic_b italic_a italic_b italic_i italic_l italic_i italic_t italic_y = 0.5). And we train our model with 50 epochs with an initial learning rate 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and weight decay 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 10 epochs of warm-up.

Fine-tuning Settings In the supervised classification experiments, we adopt the linear classification settings [10], where the pre-trained image encoder is frozen, and only a randomly initialized linear classification head is trained. We choose the same image augmentations to the above pre-training settings. And we fine-tune our model with 30 epochs with an initial learning rate 5×1075superscript1075\times 10^{-7}5 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT and weight decay 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 6 epochs of warm-up. And all our training tasks are completed on four RTX 3090 GPUs.

Appendix C Dataset Descriptions

C-A MIMIC-EYE

The MIMIC-EYE [33] dataset includes a comprehensive range of patient information, including medical images and reports, clinical data, patient’s hospital journey, and eye-tracking data and audio of radiologists during diagnosis. The dataset comprises a total of 3689 images from the MIMIC-IV v1.0 dataset [36], each accompanied by transcripts text from audio and eye-tracking data of radiologists. In this work, we use this dataset as our training set.

C-B CheXpert

CheXpert [39] is a large-scale public dataset for chest radiograph interpretation, developed by a team from Stanford University. The dataset comprises 224,316 chest radiographic images involving 65,240 patients, annotated for the presence of 14 common chest radiographic findings [49]. These annotations are categorized into three types: positive, negative, or uncertain. In our study, we follow [10] and [24], using two subsets of this dataset, namely CheXpert 5x200 and CheXpert 8x200, for our zero-shot classification and zero-shot retrieval testing tasks. The CheXpert 5x200 dataset [10] comprises five common chest diseases, Atelectasis𝐴𝑡𝑒𝑙𝑒𝑐𝑡𝑎𝑠𝑖𝑠Atelectasisitalic_A italic_t italic_e italic_l italic_e italic_c italic_t italic_a italic_s italic_i italic_s, Cardiomegaly𝐶𝑎𝑟𝑑𝑖𝑜𝑚𝑒𝑔𝑎𝑙𝑦Cardiomegalyitalic_C italic_a italic_r italic_d italic_i italic_o italic_m italic_e italic_g italic_a italic_l italic_y, Consolidation𝐶𝑜𝑛𝑠𝑜𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛Consolidationitalic_C italic_o italic_n italic_s italic_o italic_l italic_i italic_d italic_a italic_t italic_i italic_o italic_n, Edema𝐸𝑑𝑒𝑚𝑎Edemaitalic_E italic_d italic_e italic_m italic_a, and PleuralEffusion𝑃𝑙𝑒𝑢𝑟𝑎𝑙𝐸𝑓𝑓𝑢𝑠𝑖𝑜𝑛Pleural\ Effusionitalic_P italic_l italic_e italic_u italic_r italic_a italic_l italic_E italic_f italic_f italic_u italic_s italic_i italic_o italic_n, each with 200 chest X-rays. In [10], a radiologist provided possible sub-types, severities, and locations for these five diseases. As depicted in Tab. V, all combinations of these three types of information form the text queries for CheXpert 5x200 dataset. In the zero-shot classification task, image embeddings are compared with the embeddings of these text queries, and the class with the highest similarity is assigned as the predicted classification for the image. The CheXpert 8x200 dataset [24] comprises eight categories, NoFinding𝑁𝑜𝐹𝑖𝑛𝑑𝑖𝑛𝑔NoFindingitalic_N italic_o italic_F italic_i italic_n italic_d italic_i italic_n italic_g, Cardiomegaly𝐶𝑎𝑟𝑑𝑖𝑜𝑚𝑒𝑔𝑎𝑙𝑦Cardiomegalyitalic_C italic_a italic_r italic_d italic_i italic_o italic_m italic_e italic_g italic_a italic_l italic_y, Edema𝐸𝑑𝑒𝑚𝑎Edemaitalic_E italic_d italic_e italic_m italic_a, Pneumonia𝑃𝑛𝑒𝑢𝑚𝑜𝑛𝑖𝑎Pneumoniaitalic_P italic_n italic_e italic_u italic_m italic_o italic_n italic_i italic_a, Atelectasis𝐴𝑡𝑒𝑙𝑒𝑐𝑡𝑎𝑠𝑖𝑠Atelectasisitalic_A italic_t italic_e italic_l italic_e italic_c italic_t italic_a italic_s italic_i italic_s, Pneumothorax𝑃𝑛𝑒𝑢𝑚𝑜𝑡𝑜𝑟𝑎𝑥Pneumothoraxitalic_P italic_n italic_e italic_u italic_m italic_o italic_t italic_h italic_o italic_r italic_a italic_x, PleuralEffusion𝑃𝑙𝑒𝑢𝑟𝑎𝑙𝐸𝑓𝑓𝑢𝑠𝑖𝑜𝑛Pleural\ Effusionitalic_P italic_l italic_e italic_u italic_r italic_a italic_l italic_E italic_f italic_f italic_u italic_s italic_i italic_o italic_n, and Fracture𝐹𝑟𝑎𝑐𝑡𝑢𝑟𝑒Fractureitalic_F italic_r italic_a italic_c italic_t italic_u italic_r italic_e, each with 200 images. In [24], a radiologist expert was also invited to compose five expert queries for each category, used for image-text retrieval tasks. Specific queries are detailed in Tab. VI.

TABLE V: Examples of possible sub-types, severities, and locations provided by the radiologist in CheXpert 5x200 dataset.
Atelectasis Consolidation Pleural Effusion
severity mild increased small
minimal improved stable
apperance of
subtype subsegmental atelectasis bilateral consolidation bilateral pleural effusion
linear atelectasis reticular consolidation subpulmonic pleural effusion
trace atelectasis patchy consolidation bilateral pleural effusion
bibasilar atelectasis airspace consolidation
retrocardiac atelectasis partial consolidation
bandlike atelectasis
location at the mid lung zone at the lower lung zone left
at the upper lung zone at the upper lung zone right
at the right lung zone at the left lower lobe tiny
at the left lung zone at the right lower lobe
at the lung bases at the left upper lobe
TABLE VI: Examples of text queries for different categories in the CheXpert 8x200 dataset.
Categories Text Query
No Finding The lungs are clear.
No abnormalities are present.
The chest is normal.
No clinically significant radiographic abormalities.
No radiographically visible abnormalities in the chest.
Cardiomegaly The heart is mildly enlarged.
Cardiomegaly is present.
The heart shadow is enlarged.
The cardiac silhouette is enlarged.
Cardiac enlargement is seen.
Edema Mild interstitial pulmonary edema is present.
The presence of hazy opacity suggests interstitial pulmonary edema.
Moderate alveolar edema is present.
Mild diffuse opacity likely represents pulmonary edema.
Cardiogenic edema likely is present.
Pneumonia A consolidation at the base likely represents pneumonia.
Pneumonia is present.
The presence of air bronchograms suggest pneumonia.
A fluffy opacity suggests pneumonia.
A pulmonary opacity with ill defined borders likely represents pneumonia.
Atelectasis Platelike opacity likely represents atelectasis.
Geometric opacity likely represents atelectasis.
Atelectasis is present.
Basilar opacity and volume loss is likely due to atelectasis.
Patchy atelectasis is seen.
Pneumothorax An apical pneumothorax is present.
A basilar pneumothorax is seen.
A medial pneumothorax is present adjacent to the heart.
A lateral pleural line suggests pneumothorax.
Pleural air is present.
Pleural Effusion A pleural effusion is present.
Blunting of the costophrenic angles represents pleural effusions.
Trace pleural fluid is present.
The pleural space is partially filled with fluid.
Layering pleural effusions are present.
Fracture An angulated fracture is present.
An oblique radiolucent line suggests a fracture.
A cortical step off indicates the presence of a fracture.
A communuted displaced fracture is present.
A fracture is present.

C-C RSNA

The RSNA Pneumonia Detection Dataset [40], encompasses a comprehensive set of medical imaging data types, including X-rays, CT (Computed Tomography), and MRI (Magnetic Resonance Imaging) images. In this work, we utilized the stage 2 version of this dataset, comprising 29,700 chest X-ray images categorized into normal𝑛𝑜𝑟𝑚𝑎𝑙normalitalic_n italic_o italic_r italic_m italic_a italic_l and pneumonia𝑝𝑛𝑒𝑢𝑚𝑜𝑛𝑖𝑎pneumoniaitalic_p italic_n italic_e italic_u italic_m italic_o italic_n italic_i italic_a positive category. Following [10], we allocated 15% of this dataset for our zero-shot classification testing set. And we utilize the text queries from the “no finding” and “Pneumonia” categories in the CheXpert 8x200 dataset as the text queries for zero-shot classification in this data.

C-D SIIM-ACR

The SIIM-ACR [41] dataset is a chest dataset used for pneumothorax𝑝𝑛𝑒𝑢𝑚𝑜𝑡𝑜𝑟𝑎𝑥pneumothoraxitalic_p italic_n italic_e italic_u italic_m italic_o italic_t italic_h italic_o italic_r italic_a italic_x classification and segmentation. It consists of 2379 images with pneumothorax and 8300 images without pneumothorax. In this study, we utilized a subset of the dataset filtered by Saab et al. [42] as the test data to evaluate the zero-shot classification performance of the model for pneumothorax disease. And we utilize the text queries from the “no finding” and “Pneumothorax” categories in the CheXpert 8x200 dataset as the text queries for zero-shot classification in this data.

Appendix D Details of Multi-modal Data Processing

Refer to caption
Figure 5: The generation methods for heatmap at both word-level and sentence-level.

As illustrated in Fig. 5, the presentation of multi-modal data in the MIMIC-EYE [33] dataset includes radiologists’ audio, text transcript, eye-gaze data, and image. Since each modality is synchronized, the audio data is aligned with the eye-gaze data in time. By segmenting the audio based on the time before and after the pronunciation of each word, we can align the transcripts with the audio, thereby aligning word-level text with eye-gaze data. Subsequently, we generate attention heatmap based on eye-gaze data and images to represent the image regions the radiologist focuses on. Through the aforementioned data processing steps, we achieve precise alignment between word-level text and image regions. It is noteworthy that due to the rapid speech rate of radiologists, there may be no available eye-gaze data within the time interval corresponding to a single word. In Fig. 5, the word “with” in the transcript has no corresponding gaze data. Another common and unavoidable issue is the loss of eye-gaze data caused by blinking and intense head movement of radiologist, as seen in the last two words of the transcript. Due to these technical challenges, achieving perfect pairing between words and image regions is difficult. However, as shown in the right side of Fig. 5, adjusting the text to the sentence level largely mitigates the issue of missing word-level heatmap (Heatmap with red edge), and the semantic information of the entire sentence also encompasses the information of each word. Therefore, in this work, we process text features at the sentence level.

During the pre-training of EGMA, the size of the input heatmap is determined by the number of patches in the image. For example, after the image is processed by the image encoder, the size of image embedding is 196×768196768196\times 768196 × 768, where 196 represents the number of image patches. Therefore, we resize the heatmap directly to 14×14141414\times 1414 × 14 to match the image embedding and further process it into the Gaze-guided Similarity and Gaze-guided Label mentioned in the main manuscript.

Appendix E Additional Visualization Results

Refer to caption
Figure 6: Visualization of feature representation of CheXpert 5x200 dataset by CLIP, GLoRIA, MGCA, and our EGMA.
TABLE VII: Comparison results of zero-shot classification after continue pre-training using the backbones of other SOTA models in our EGMA framework. Red and blue denote the best and second-best results. The values in (parentheses) represents the improvement over the baseline metrics in Table 1 of main manuscript.
Method CheXpert 5x200 [39] RSNA [40] SIIM-ACR [41]
Acc.\uparrow F1\uparrow Acc.\uparrow F1\uparrow Acc.\uparrow F1\uparrow
CLIP [1] 20.30(0.2) 10.73(1.61) 34.04(9.01) 33.68(11.61) 50.19(0.8) 49.03(1.05)
GLoRIA [10] 54.40(1.1) 49.31(0.32) 49.11(19.96) 38.82(7.81) 31.07(8.5) 31.10(8.53)
MGCA [11] 50.20(6.6) 48.29(6.92) 57.08(-3.75) 40.40(-17.37) 32.65(2.62) 27.78(2.33)
EGMA(Ours) 61.30 60.38 76.97 43.49 63.62 61.46

In Fig. 6, we visualize the feature representations of CLIP [1], GLoRIA [10], MGCA [11], and our EGMA model on images of CheXpert 5x200 dataset using the t-SNE [44]. It can be observed that our model exhibits better clustering representation. The CLIP model, which was not trained on medical data, is unable to effectively differentiate these diseases. Additionally, while the representation capability of GLoRIA and MGCA has improved noticeably, their clustering performance still inferior to our EGMA.

Appendix F Additional Analysis Results

To further validate the generality of our framework, we utilize the encoders and pre-training weights provided by CLIP [1], GLoRIA [10], and MGCA [11] in our EGMA framework. Subsequently, we continue training on the MIMIC-EYE [33] dataset and present the zero-shot classification results on the CheXpert 5x200 [39], RSNA [40], and SIIM-ACR [41] datasets in Tab. VII.

We present accuracy Accuracy and F1 score metrics on three datasets. The values in parentheses indicate the improvement over the baseline metrics (as shown in Tab. II). It can be observed that, all models show improvement after training with our EGMA framework, except for the decrease in metrics for the trained MGCA model on the RSNA dataset. For MGCA, when tested with its provided pre-trained weights, it performs the best F1-score on the RSNA dataset (as shown in Tab. II), but after training with the EGMA framework, its performance improves on CheXpert 5x200 and SIIM but decreases on RSNA dataset. This may reflect that the features extracted by MGCA on RSNA dataset are not truly disease-related features but rather shortcut features, indicating that the high baseline metrics were based on easily distinguishable shortcut features. Furthermore, after training with EGMA, the performance of other models significantly improves on all three datasets.