research-article

A transformer fine-tuning strategy for text dialect identification

Authors:

Mohammad Ali Humayun,

Abdullah Alourani,

Pg Emeroylariffion AbasAuthors Info & Claims

Neural Computing and Applications, Volume 35, Issue 8

Pages 6115 - 6124

https://doi.org/10.1007/s00521-022-07944-5

Published: 15 November 2022 Publication History

Abstract

Online medical consultation can significantly improve the efficiency of primary health care. Recently, many online medical question–answer services have been developed that connect the patients with relevant medical consultants based on their questions. Considering the linguistic variety in their question, social background identification of patients can improve the referral system by selecting a medical consultant with a similar social origin for efficient communication. This paper has proposed a novel fine-tuning strategy for the pre-trained transformers to identify the social origin of text authors. When fused with the existing adapter model, the proposed methods achieve an overall accuracy of 53.96% for the Arabic dialect identification task on the Nuanced Arabic Dialect Identification (NADI) dataset. The overall accuracy is 0.54% higher than the previous best for the same dataset, which establishes the utility of custom fine-tuning strategies for pre-trained transformer models.

References

[1]

Cao YG, Liu F, Simpson P, Antieau L, Bennett A, Cimino JJ, Ely J, and Yu H AskHERMES: An online question answering system for complex clinical questions J Biomed Inform 2011

[2]

Chen CW, Tseng SP, Kuan TW, and Wang JF Outpatient text classification using attention-based bidirectional LSTM for robot-assisted servicing in hospital Inf 2020

[3]

Abdul-Mageed M, Zhang C, Elmadany A, Bouamor H, Habash N (2021) NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task, ArXiv Prepr. arXiv:2103. http://arxiv.org/abs/2103.08466.

[4]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst. 5999–6009.

[5]

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding, NAACL HLT 2019. In: 2019 Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies. 1 (2019) 4171–4186.

[6]

Humayun MA, Yassin H, and Abas PE Spatial position constraint for unsupervised learning of speech representations PeerJ Comput Sci 2021 7 1-24

[7]

Abdelali A, Hassan S, Mubarak H, Darwish K, Samih Y (2021) Pre-training BERT on Arabic tweets: practical considerations. http://arxiv.org/abs/2102.10684

[8]

Antoun W, Baly F, Hajj H (2020) AraBERT: Transformer-based model for arabic language understanding. http://arxiv.org/abs/2003.00104.

[9]

Houlsby N, Giurgiu A, Jastrzçbski S, Morrone B, de Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S (2019) Parameter-efficient transfer learning for NLP. In: 36th international conference on machine learning. ICML 2019, pp 4944–4953

[10]

Abdul-Mageed M, Elmadany A, Nagoudi EMB (2021) ARBERT & MARBERT: deep bidirectional transformers for Arabic, pp 7088–7105.

[11]

AlKhamissi B, Gabr M, ElNokrashy M, Essam K (2021) Adapting MARBERT for Improved Arabic Dialect Identification: Submission to the NADI 2021 Shared Task. In: Proceedings sixth Arabic natural language processing work. pp 260–264. https://aclanthology.org/2021.wanlp-1.29.

[12]

Humayun MA, Yassin H, and Abas PE Native language identification for Indian-speakers by an ensemble of phoneme-specific, and text-independent convolutions Speech Commun 2022

[13]

Björklund J and Zechner N Syntactic methods for topic-independent authorship attribution Nat Lang Eng 2017

[14]

Al-Yahya M Stylometric analysis of classical Arabic texts for genre detection Electron Libr 2018

[15]

Abbasi A and Chen H Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace ACM Trans Inf Syst 2008

[16]

Neal T, Sundararajan K, Fatima A, Yan Y, Xiang Y, and Woodard D Surveying stylometry techniques and applications ACM Comput Surv 2017

[17]

Luyckx K and Daelemans W Authorship attribution and verification with many authors and limited data Artif Intell Conf 2008

[18]

Fatima M, Hasan K, Anwar S, and Nawab RMA Multilingual author profiling on facebook Inf Process Manag 2017 53 886-904

[19]

Koppel M, Argamon S, and Shimoni AR Automatically categorising written texts by author gender Lit Linguist Comput 2002 17 401-412

[20]

Salton G, Wong A, and Yang CS A vector space model for automatic indexing Commun ACM 1975 18 613-620

[21]

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations. ICLR 2013 – workshop track proceedings

[22]

Pennington J, Socher R, Manning CD (2014) GloVe: Global vectors for word representation, in: EMNLP 2014 – proceedings of the 2014 conference on empirical methods in natural language processing. pp 1532–1543.

[23]

Wang C, Banko M (2021) Practical transformer-based multilingual text classification. pp 121–129.

[24]

Radford A, Narasimhan T, Salimans T, Sutskever I (2018) [GPT-1] Improving Language Understanding by Generative Pre-Training, in: Preprint. pp. 1–12.

[25]

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) GPT-3, NeurIPS. 2020-Decem

[26]

Dale R GPT-3: What's it good for? Nat Lang Eng 2021

[27]

Azzouza N, Akli-Astouati K, and Ibrahim R Twitterbert: framework for twitter sentiment analysis based on pre-trained language model representations Adv Intell Syst Comput 2020

[28]

Gao Z, Feng A, Song X, and Wu X Target-dependent sentiment classification with BERT IEEE Access 2019 7 154290-154299

[29]

Sun C, Qiu X, Xu Y, Huang X (2019) How to Fine-Tune BERT for Text Classification?. In: Lecture notes in computer science (Including its subseries lecture notes in artificial intelligence and lecture notes in bioinformatics).

[30]

Fabien M, Villatoro-Tello E, Motlicek P, Parida S (2020) BertAA: BERT fine-tuning for Authorship Attribution. In: Proceedings of the 17th international conference on natural language processing

[31]

Zhang C, Abdul-Mageed M (2019) BERT-based Arabic social media author profiling. In: CEUR workshop proceedings. pp 84–91

[32]

Suman C, Naman A, Saha S, and Bhattacharyya P A multimodal author profiling system for tweets IEEE Trans Comput Soc Syst 2021

[33]

W. Zaghouani, A. Charfi, AraP-Tweet: A large multi-dialect twitter corpus for gender, age and language variety identification. In: Lr. 2018 - eleventh international conference on language resources and evaluation. pp 694–700

[34]

Zaghouani W, Charfi A (2018) Guidelines and annotation framework for arabic author profiling, CoRR.abs/1808.0

[35]

F. Rangel, P. Rosso, A. Charfi, W. Zaghouani, B. Ghanem, J. Sánchez-Junquera (2019) On the author profiling and deception detection in Arabic shared task at FIRE. In: Pervasive health pervasive computing technologies for healthcare. pp 7–9.

[36]

Talafha B, Farhan W, Altakrouri A, Al-Natsheh H (2019) Mawdoo3 AI at MADAR shared task: Arabic tweet dialect identification. pp 239–243.

[37]

El Mekki A, Alami A, Alami H, Khoumsi A, Berrada I (2020) Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification. In: Proceedings fifth Arabic national language processing work. pp. 268–274. https://www.aclweb.org/anthology/2020.wanlp-1.27

[38]

Wadhawan A (2021) Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT. In: Proceedings sixth Arabic national language processing work. http://arxiv.org/abs/2102.09749

[39]

Lichouri M, Abbas M, Lounnas K, Benaziz B, Zitouni A (2021) Arabic dialect identification based on a weighted concatenation of TF-IDF features. In: Proceedings sixth Arabic national language processing work. pp 282–286. https://www.aclweb.org/anthology/2021.wanlp-1.33

Recommendations

Indo-Aryan Dialect Identification Using Deep Learning Ensemble Model
Abstract
Language identification has become a critical challenge in NLP, particularly in multilingual countries like India. This study addresses the identification of closely related Indo-Aryan languages, proposing a robust deep-learning ensemble model ...
Empirical analysis of linguistic and paralinguistic information for automatic dialect classification

Current research in automatic speech recognition is primarily concerned with the correct evaluation of linguistic information transmitted in the speech signal and with the identification of variations, naturally present in speech. These differences in ...
Dialect Identification in�Ao Using Modulation-Based Representation
Speech and Computer
Abstract
This paper presents an automatic dialect identification in Ao using modulation-based approach. Ao is a low-resource, Tibeto-Burman tonal language spoken in Nagaland, a North-East state of India. This work aims to investigate dialect-specific ...

Comments

Information & Contributors

Information

Published In

cover image Neural Computing and Applications

Neural Computing and Applications Volume 35, Issue 8

Mar 2023

709 pages

ISSN:0941-0643

EISSN:1433-3058

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 November 2022

Accepted: 11 October 2022

Received: 05 March 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents