Skip to main content

ComicBERT: A Transformer Model and Pre-training Strategy for Contextual Understanding in Comics

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2024 Workshops (ICDAR 2024)

Abstract

Despite the growing interest in digital comic processing, foundational models tailored for this medium still need to be explored. Existing methods employ multimodal sequential models with cloze-style tasks, but they fall short of achieving human-like understanding. Addressing this gap, we introduce a novel transformer-based architecture, Comicsformer, and a comprehensive framework, ComicBERT, designed to process and understand the complex interplay of visual and textual elements in comics. Our approach utilizes a self-supervised objective, Masked Comic Modeling, inspired by BERT’s [6] masked language modeling objective, to train the foundation model. To fine-tune and validate our models, we adopt existing cloze-style tasks and propose new tasks - such as scene-cloze, which better capture the narrative and contextual intricacies unique to comics. Preliminary experiments indicate that these tasks enhance the model’s predictive accuracy and may provide new tools for comic creators, aiding in character dialogue generation and panel sequencing. Ultimately, ComicBERT aims to serve as a universal comic processor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/gsoykan/comicbert.

  2. 2.

    https://numpy.org/doc/stable/reference/generated/numpy.lexsort.html.

  3. 3.

    https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html.

References

  1. Agrawal, H., Mishra, A., Gupta, M., et al.: Multimodal persona based generation of comic dialogs. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14150–14164 (2023)

    Google Scholar 

  2. Augereau, O., Iwata, M., Kise, K.: An overview of comics research in computer science. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 3, pp. 54–59. IEEE (2017)

    Google Scholar 

  3. Brienza, C.: Producing comics culture: a sociological approach to the study of comics. J. Graph. Novels Comics 1(2), 105–119 (2010)

    Article  Google Scholar 

  4. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations (2020). arXiv preprint arXiv:2002.05709

  5. Cohn, N.: The Visual Language of Comics: Introduction to the Structure and Cognition of Sequential Images. A &C Black (2013)

    Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  8. Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs) (2016). arXiv preprint arXiv:1606.08415

  9. Herbst, P., Chazan, D., Chen, C.L., Chieu, V.M., Weiss, M.: Using comics-based representations of teaching, and technology, to bring practice to teacher education courses. ZDM 43(1), 91–103 (2011)

    Article  Google Scholar 

  10. Iyyer, M., et al.: The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives (2017)

    Google Scholar 

  11. Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. Technologies 9(1), 2 (2020)

    Article  Google Scholar 

  12. Laubrock, J., Dunst, A.: Computational approaches to comics analysis. Top. Cogn. Sci. 12(1), 274–310 (2020)

    Article  Google Scholar 

  13. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017). arXiv preprint arXiv:1711.05101

  14. Nguyen, N.V., Rigaud, C., Burie, J.C.: Comic MTL: optimized multi-task learning for comic book image analysis. Int. J. Doc. Anal. Recogn. (IJDAR) 22, 265–284 (2019)

    Article  Google Scholar 

  15. Nguyen, N.-V., Rigaud, C., Revel, A., Burie, J.-C.: Manga-MMTL: multimodal multitask transfer learning for manga character analysis. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 410–425. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_27

    Chapter  Google Scholar 

  16. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019). https://arxiv.org/abs/1908.10084

  17. Sachdeva, R., Zisserman, A.: The manga whisperer: Automatically generating transcriptions for comics. CoRR abs/2401.10224 (2024). https://doi.org/10.48550/ARXIV.2401.10224

  18. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019). ArXiv abs/1910.01108

    Google Scholar 

  19. Soykan, G., Yuret, D., Sezgin, T.M.: A comprehensive gold standard and benchmark for comics text detection and recognition (2022)

    Google Scholar 

  20. Soykan, G., Yuret, D., Sezgin, T.M.: Identity-aware semi-supervised learning for comic character re-identification (2023)

    Google Scholar 

  21. Su, Y., et al.: TaCL: Improving BERT pre-training with token-aware contrastive learning (2021). arXiv preprint arXiv:2111.04198

  22. Sunder, V., Fosler-Lussier, E., Thomas, S., Kuo, H.K.J., Kingsbury, B.: Tokenwise contrastive pretraining for finer speech-to-BERT alignment in end-to-end speech-to-intent systems (2022). arXiv preprint arXiv:2204.05188

  23. Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)

    Google Scholar 

  24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  25. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)

    Google Scholar 

  26. Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12113–12132 (2023). https://doi.org/10.1109/TPAMI.2023.3275156

    Article  Google Scholar 

Download references

Acknowledgments

This project is supported by Koç University & İş Bank AI Center (KUIS AI). We would like to thank KUIS AI for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G�rkan Soykan .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Appendices

A Data Preparation Details

1.1 A.1 Handling Characters and�Associating with�Speech Bubbles

First, we performed pairing for bodies and faces. The algorithm used for pairing works as follows: we calculate the intersection rate of faces within bodies. If it is higher than 20%, they are added to the candidate list. From the candidates, we select the one with the highest intersection rate. However, in some cases, there can be multiple faces within a body bounding box. In such situations, the y-coordinate of the faces is used to choose the one higher up in the body bounding box. If there is no intersection, faces or bodies alone can represent the entire character instance.

After extracting character instances, finding their associations with the existing speech bubbles is necessary. The Multi-Task Learning (MTL) model provides relation scores between all speech bubbles and all faces and bodies.

We filtered out the ones with relation scores below 0.2. Then, we successfully attempted to assign relations to character instances based on these scores.

Since we had already determined which panel each component belonged to, we leveraged this information. Therefore, we filtered the relations for speech bubbles, faces, and bodies based on whether they belong to the same panel. As a result, the possibility of a speech bubble being assigned to a character outside of its panel was eliminated.

Next, we checked the relation scores for the components that make up each character (i.e., face and body). We selected the highest score associated with the character instance’s chosen speech bubble.

While associating each speech bubble with a character, initially, each character should be assigned the highest-scoring speech bubble that corresponds to them. Then, any remaining speech bubbles may belong to the same character as their second or third speech bubble.

At this point, we employed the same algorithm again. However, if a speech bubble has already been assigned to a character, we added a penalty of up to 0.35 in subsequent calls for that character’s possible next assignments. The purpose of this penalty is to prevent the model from making incorrect associations, as the likelihood of one of two characters owning both speech bubbles in the same panel is lower than the possibility of them both sharing a single speech bubble.

1.2 A.2 Z-Order Calculation

To simulate the Z-order, we performed the following steps. First, we obtained the union box of the bounding boxes and then partitioned this union box into an \(n \times n\) grid. The center points of the bounding boxes are then snapped to the grid points. Next, each bounding box is considered as a point with integer values for the x and y coordinates on the grid. Then these points are sorted first based on the column values and then based on the row valuesFootnote 2 to calculate the Z-order. The value of \(n\) is 4, which applies to both panels and speech bubbles.

1.3 A.3 Component Association with Panels

Every detected component is assigned to a panel using the following logic: their intersection rate with each panel is calculated, along with their distance based on the center coordinates. Then, the panel with the maximum intersection rate is chosen for assignment. In cases where a bounding box has the same intersection rate with multiple panels, the one with the minimum center distance is selected. If the intersection rate is below 0.2, then that bounding box is left unassigned and is not used for our experiments.

Fig. 3.
figure 3

A sample comic page is used to demonstrate the finalized inference process of the MTL model for the data preparation step in the Comicsformer model. The panels are z-ordered, and speech bubbles within a panel are displayed with their order indicated at the bottom left. Characters’ components are colored with the same color, and speech bubble associations are shown with lines originating from their centers.

1.4 A.4 OCR and Text Extraction

To extract text from comic books, the speech bubble segmentations from the Multi-Task Learning (MTL) model are utilized. Instead of directly using the text from Comics Text+, which is obtained from speech bubble detections, the decision was made to use segmentations. This choice results in improved text extraction because the image used with Optical Character Recognition (OCR) models represents the pure content of the speech bubble without any background or overlapping speech bubbles, ideally. As a result, the extracted text is of even higher quality compared to Comics Text+, which already outperforms the original COMICS dataset’s texts.

The segmentation mask is extracted for every speech bubble instance if it surpasses the 0.6 score threshold. Subsequently, morphological transformations are applied to the binary mask using the cv2 libraryFootnote 3 to smooth out the edges and fill small gaps. Some examples can be seen in Fig.�4.

Fig. 4.
figure 4

The Comic Text+ OCR models use examples from speech bubble segmentations as shown above. Utilizing speech bubble segmentations instead of detections is particularly useful for irregularly shaped speech bubbles.

B Analysis Based on�Word Frequencies

For a brief textual analysis, we would like to share the word frequency distributions. The most frequent words in both speech and narrative text can be observed in Figs.�6 and�5.

Fig. 5.
figure 5

Frequency distribution of the top 20 common words, apart from stopwords and punctuation, in narrative boxes.

Fig. 6.
figure 6

Frequency distribution of the top 20 common words, apart from stopwords and punctuation, in speech bubbles.

C Details of�Scoring and�Option Embeddings

We explain the process of obtaining option embeddings separately for each task, following the explanation of the scoring process. Considering the method of context embedding extraction, the scoring process for an option is as follows:

$$ \text {option\_logit} = \textbf{O} \cdot \textbf{c} $$

Since we perform L2 normalization on both the option and the context vectors, the option logit equals the cosine of the angle between those two vectors:

$$ \text {option\_logit} = \cos (\theta ) $$

Given that the cross-entropy loss is used as the loss function and there are three options, which is consistent across all cloze-style tasks, the final scores are calculated as follows:

$$ \text {scores} = \text {softmax}([\text {option\_logit}_0, \text {option\_logit}_1, \text {option\_logit}_2]) $$

The process for extracting option embeddings for each task is as follows:

  • Text-Cloze: The final panel is encoded as Comicsformer input and fed to the Comicsformer. The output feature at the position of the speech bubble, which is the third position (assuming the final panels should only have a single speech bubble text for the text cloze), is obtained and processed with the output projector, which has the same architecture as the context projector, i.e., a linear layer followed by L2 normalization. Depending on the model type, encoded panel features are masked; for instance, if it is the panel-only (image-only) model, then character input features are masked in the input.

  • Visual-Cloze: Similar to Text-Cloze, but with textual (speech bubble, narrative) input features for the final panel masked. Since the task requires excluding text information from the final panel apart from the visual features, when character modality is involved, it is not masked in the input. However, only the output feature of the panel token is used from the outputs.

  • Scene-Cloze: There are no restrictions for Scene-Cloze, thus all modalities of the model type are used. Similar to context extraction, the mean pooling operation is applied to get the final output features before the output projector.

  • Character Coherence: Similar to Scene-Cloze, all available modality information is used in the input of the Comicsformer. However, only the character and speech output features are used during the mean pooling operation. The options construction differs from the other tasks since they use information from other panels. The correct options use panel information as is, whereas for the wrong option, the speech input token feature positions are swapped between the first and second.

  • Contextual Character to Speech Attribution: This task differs from the others as it does not use the Comicsformer architecture for options. Instead, context features are fused with the character projections coming from pre-trained models. Their similarity with speech projections is measured after both projections are processed through a linear layer and L2 normalization.

Proper masking is applied for each model type. Based on the modality available during the context and option feature extraction, masking was applied accordingly. The model types are as follows:

  • Text-only: Uses only text modality, speech, and narrative.

  • Char-only: Uses only character modality.

  • Image-only (Panel-only): In the original task definition, it was called Image-only when the model uses only panel visual features. However, we used character information based on the characters’ images. Hence, the modal is changed to Panel-only.

  • Image-Text (Panel-Text): Uses panel visual information and text modality.

  • Panel-Char: Uses panel visual information and character visual features.

  • Char-Text: Uses character visual features with text features.

  • Panel-Char-Text: Uses all available information from the three modalities.

D ComicBERT Pretraining Results

Table 5 present the pretraining results.

Table 5. Pretraining Results of the Masked Comic Modeling (MCM) Task for ComicBERT. Two distinct settings were employed during the experiments. The upper setting involved fewer sequences and shorter training duration, optimizing for time efficiency (fast setting). The fast setting was utilized for most experiments, except for the results in Table 4. For each setting, the batch size was 128.

E No-Context (NC) Models

The No-Context (NC) model is a model type proposed by Iyyer�et al.�[10] to measure the impact of context. The NC model has only been employed in the Panel-Text (Image-Text) model. It can be regarded as an ablation study specific to context. Instead of context, the visual features of the final panel \(p_i\) are used in place of context for this model.

F Masked Comic Modeling Algorithm

The main learning algorithm for Masked Comic Modeling(MCM) is detailed in Algorithm�1.

Algorithm 1
figure a

. Main learning algorithm of Masked Comic Modeling (MCM)

G Obtaining Sequential Context Embedding from�Comicsformer

Figure�7 illustrates the process of context extraction from Comicsformer.

Fig. 7.
figure 7

Obtaining the context embedding with Comicsformer from comic sequences. The transformer encoder outputs at the core of Comicsformer are fused using mean pooling. During mean pooling, the outputs corresponding to padding tokens are not considered. The resulting mean-pooled output is passed through a linear layer (in the experiments, the input-output dimensions are 384\(\,\times \,\)256). Finally, L2 normalization is applied. L2 normalization lets us directly obtain the cosine similarity since downstream tasks use dot product.

H ComicBERT Pre-Training Details

During the ComicBERT pre-training phase, we employed two distinct settings, namely slow and fast, based on the duration of training. The fast setting was trained for 50 epochs and utilized a smaller dataset. This training was completed within approximately 24�h on an NVIDIA V100 GPU. In contrast, the slow setting took nearly a week on the same GPU for training. A substantial difference becomes evident when observing the pre-training outcomes for both settings (see Appendix�D for pretraining results). However, this contrast is not equally reflected in the results of the cloze-style tasks. The results are presented in Table�4, which were achieved using ComicBERT trained with the slow setting. Nevertheless, its contribution over the fast setting remains marginal. Sharing the results of these two experimental settings could guide future research involving similar experiments, saving time and serving as a reference.

Rights and permissions

Reprints and permissions

Copyright information

� 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Soykan, G., Yuret, D., Sezgin, T.M. (2024). ComicBERT: A Transformer Model and Pre-training Strategy for Contextual Understanding in Comics. In: Mouchère, H., Zhu, A. (eds) Document Analysis and Recognition – ICDAR 2024 Workshops. ICDAR 2024. Lecture Notes in Computer Science, vol 14935. Springer, Cham. https://doi.org/10.1007/978-3-031-70645-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70645-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70644-8

  • Online ISBN: 978-3-031-70645-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics