skip to main content
research-article

A transformer fine-tuning strategy for text dialect identification

Published: 15 November 2022 Publication History

Abstract

Online medical consultation can significantly improve the efficiency of primary health care. Recently, many online medical question–answer services have been developed that connect the patients with relevant medical consultants based on their questions. Considering the linguistic variety in their question, social background identification of patients can improve the referral system by selecting a medical consultant with a similar social origin for efficient communication. This paper has proposed a novel fine-tuning strategy for the pre-trained transformers to identify the social origin of text authors. When fused with the existing adapter model, the proposed methods achieve an overall accuracy of 53.96% for the Arabic dialect identification task on the Nuanced Arabic Dialect Identification (NADI) dataset. The overall accuracy is 0.54% higher than the previous best for the same dataset, which establishes the utility of custom fine-tuning strategies for pre-trained transformer models.

References

[1]
Cao YG, Liu F, Simpson P, Antieau L, Bennett A, Cimino JJ, Ely J, and Yu H AskHERMES: An online question answering system for complex clinical questions J Biomed Inform 2011
[2]
Chen CW, Tseng SP, Kuan TW, and Wang JF Outpatient text classification using attention-based bidirectional LSTM for robot-assisted servicing in hospital Inf 2020
[3]
Abdul-Mageed M, Zhang C, Elmadany A, Bouamor H, Habash N (2021) NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task, ArXiv Prepr. arXiv:2103. http://arxiv.org/abs/2103.08466.
[4]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst. 5999–6009.
[5]
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding, NAACL HLT 2019. In: 2019 Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies. 1 (2019) 4171–4186.
[6]
Humayun MA, Yassin H, and Abas PE Spatial position constraint for unsupervised learning of speech representations PeerJ Comput Sci 2021 7 1-24
[7]
Abdelali A, Hassan S, Mubarak H, Darwish K, Samih Y (2021) Pre-training BERT on Arabic tweets: practical considerations. http://arxiv.org/abs/2102.10684
[8]
Antoun W, Baly F, Hajj H (2020) AraBERT: Transformer-based model for arabic language understanding. http://arxiv.org/abs/2003.00104.
[9]
Houlsby N, Giurgiu A, Jastrzçbski S, Morrone B, de Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S (2019) Parameter-efficient transfer learning for NLP. In: 36th international conference on machine learning. ICML 2019, pp 4944–4953
[10]
Abdul-Mageed M, Elmadany A, Nagoudi EMB (2021) ARBERT & MARBERT: deep bidirectional transformers for Arabic, pp 7088–7105.
[11]
AlKhamissi B, Gabr M, ElNokrashy M, Essam K (2021) Adapting MARBERT for Improved Arabic Dialect Identification: Submission to the NADI 2021 Shared Task. In: Proceedings sixth Arabic natural language processing work. pp 260–264. https://aclanthology.org/2021.wanlp-1.29.
[12]
Humayun MA, Yassin H, and Abas PE Native language identification for Indian-speakers by an ensemble of phoneme-specific, and text-independent convolutions Speech Commun 2022
[13]
Björklund J and Zechner N Syntactic methods for topic-independent authorship attribution Nat Lang Eng 2017
[14]
Al-Yahya M Stylometric analysis of classical Arabic texts for genre detection Electron Libr 2018
[15]
Abbasi A and Chen H Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace ACM Trans Inf Syst 2008
[16]
Neal T, Sundararajan K, Fatima A, Yan Y, Xiang Y, and Woodard D Surveying stylometry techniques and applications ACM Comput Surv 2017
[17]
Luyckx K and Daelemans W Authorship attribution and verification with many authors and limited data Artif Intell Conf 2008
[18]
Fatima M, Hasan K, Anwar S, and Nawab RMA Multilingual author profiling on facebook Inf Process Manag 2017 53 886-904
[19]
Koppel M, Argamon S, and Shimoni AR Automatically categorising written texts by author gender Lit Linguist Comput 2002 17 401-412
[20]
Salton G, Wong A, and Yang CS A vector space model for automatic indexing Commun ACM 1975 18 613-620
[21]
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations. ICLR 2013 – workshop track proceedings
[22]
Pennington J, Socher R, Manning CD (2014) GloVe: Global vectors for word representation, in: EMNLP 2014 – proceedings of the 2014 conference on empirical methods in natural language processing. pp 1532–1543.
[23]
Wang C, Banko M (2021) Practical transformer-based multilingual text classification. pp 121–129.
[24]
Radford A, Narasimhan T, Salimans T, Sutskever I (2018) [GPT-1] Improving Language Understanding by Generative Pre-Training, in: Preprint. pp. 1–12.
[25]
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) GPT-3, NeurIPS. 2020-Decem
[26]
Dale R GPT-3: What's it good for? Nat Lang Eng 2021
[27]
Azzouza N, Akli-Astouati K, and Ibrahim R Twitterbert: framework for twitter sentiment analysis based on pre-trained language model representations Adv Intell Syst Comput 2020
[28]
Gao Z, Feng A, Song X, and Wu X Target-dependent sentiment classification with BERT IEEE Access 2019 7 154290-154299
[29]
Sun C, Qiu X, Xu Y, Huang X (2019) How to Fine-Tune BERT for Text Classification?. In: Lecture notes in computer science (Including its subseries lecture notes in artificial intelligence and lecture notes in bioinformatics).
[30]
Fabien M, Villatoro-Tello E, Motlicek P, Parida S (2020) BertAA: BERT fine-tuning for Authorship Attribution. In: Proceedings of the 17th international conference on natural language processing
[31]
Zhang C, Abdul-Mageed M (2019) BERT-based Arabic social media author profiling. In: CEUR workshop proceedings. pp 84–91
[32]
Suman C, Naman A, Saha S, and Bhattacharyya P A multimodal author profiling system for tweets IEEE Trans Comput Soc Syst 2021
[33]
W. Zaghouani, A. Charfi, AraP-Tweet: A large multi-dialect twitter corpus for gender, age and language variety identification. In: Lr. 2018 - eleventh international conference on language resources and evaluation. pp 694–700
[34]
Zaghouani W, Charfi A (2018) Guidelines and annotation framework for arabic author profiling, CoRR.abs/1808.0
[35]
F. Rangel, P. Rosso, A. Charfi, W. Zaghouani, B. Ghanem, J. Sánchez-Junquera (2019) On the author profiling and deception detection in Arabic shared task at FIRE. In: Pervasive health pervasive computing technologies for healthcare. pp 7–9.
[36]
Talafha B, Farhan W, Altakrouri A, Al-Natsheh H (2019) Mawdoo3 AI at MADAR shared task: Arabic tweet dialect identification. pp 239–243.
[37]
El Mekki A, Alami A, Alami H, Khoumsi A, Berrada I (2020) Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification. In: Proceedings fifth Arabic national language processing work. pp. 268–274. https://www.aclweb.org/anthology/2020.wanlp-1.27
[38]
Wadhawan A (2021) Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT. In: Proceedings sixth Arabic national language processing work. http://arxiv.org/abs/2102.09749
[39]
Lichouri M, Abbas M, Lounnas K, Benaziz B, Zitouni A (2021) Arabic dialect identification based on a weighted concatenation of TF-IDF features. In: Proceedings sixth Arabic national language processing work. pp 282–286. https://www.aclweb.org/anthology/2021.wanlp-1.33

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Neural Computing and Applications
Neural Computing and Applications  Volume 35, Issue 8
Mar 2023
709 pages
ISSN:0941-0643
EISSN:1433-3058
Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 November 2022
Accepted: 11 October 2022
Received: 05 March 2022

Author Tags

  1. Text classification
  2. Author profiling
  3. Dialect identification
  4. Arabic language

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media