skip to main content
research-article

A BERT-Based Two-Stage Model for Chinese Chengyu Recommendation

Published: 12 August 2021 Publication History

Abstract

In Chinese, Chengyu are fixed phrases consisting of four characters. As a type of idioms, their meanings usually cannot be derived from their component characters. In this article, we study the task of recommending a Chengyu given a textual context. Observing some of the limitations with existing work, we propose a two-stage model, where during the first stage we re-train a Chinese BERT model by masking out Chengyu from a large Chinese corpus with a wide coverage of Chengyu. During the second stage, we fine-tune the re-trained, Chengyu-oriented BERT on a specific Chengyu recommendation dataset. We evaluate this method on ChID and CCT datasets and find that it can achieve the state of the art on both datasets. Ablation studies show that both stages of training are critical for the performance gain.

References

[1]
Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/daily mail reading comprehension task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 2358–2367. https://doi.org/10.18653/v1/P16-1223
[2]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019a. Pre-Training with Whole Word Masking for Chinese BERT. arxiv:cs.CL/1906.08101. http://arxiv.org/abs/cs.CL/1906.08101.
[3]
Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2019b. A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 5883–5889. https://doi.org/10.18653/v1/D19-1600
[4]
Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'15). MIT Press, Cambridge, MA, 3079–3087. http://dl.acm.org/citation.cfm?id=2969442.2969583.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[6]
Xingyi Duan, Baoxin Wang, Ziyue Wang, Wentao Ma, Yiming Cui, Dayong Wu, Shijin Wang, Ting Liu, Tianxiang Huo, Zhen Hu, et al.2019. CJRC: A reliable human-annotated benchmark dataset for Chinese judicial reading comprehension. Chinese Computational Linguistics (2019), 439–451. https://doi.org/10.1007/978-3-030-32381-3_36
[7]
Paul Ekman. 1992. An argument for basic emotions. Cognition & Emotion 6, 3–4 (1992), 169–200.
[8]
Chikara Hashimoto, Satoshi Sato, and Takehito Utsuro. 2006. Japanese idiom recognition: Drawing a line between literal and idiomatic meanings. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Association for Computational Linguistics, 353–360. https://www.aclweb.org/anthology/P06-2046.
[9]
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1693–1701. http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf.
[10]
Wan Yu Ho, Christine Kng, Shan Wang, and Francis Bond. 2014. Identifying idioms in Chinese translations. In LREC. 716–721.
[11]
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 328–339. https://doi.org/10.18653/v1/P18-1031
[12]
Zhiying Jiang, Boliang Zhang, Lifu Huang, and Heng Ji. 2018. Chengyu cloze test. In Proceedings of the 13th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 154–158. https://doi.org/10.18653/v1/W18-0516
[13]
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77. https://doi.org/10.1162/tacl_a_00300
[14]
Graham Katz and Eugenie Giesbrecht. 2006. Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties. Association for Computational Linguistics, 12–19. https://www.aclweb.org/anthology/W06-1203.
[15]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1eA7AEtvS.
[16]
Haizhou Li and Baosheng Yuan. 1998. Chinese word segmentation. In Proceedings of the 12th Pacific Asia Conference on Language, Information and Computation. Chinese and Oriental Languages Information Processing Society, Singapore, 212–217. https://doi.org/2065/12081
[17]
Dekang Lin. 1999. Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 317–324. https://doi.org/10.3115/1034678.1034730
[18]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). arxiv:1907.11692. http://arxiv.org/abs/1907.11692.
[19]
Yuanchao Liu, Bo Pang, and Bingquan Liu. 2019b. Neural-based Chinese idiom recommendation for enhancing elegance in essay writing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5522–5526. https://doi.org/10.18653/v1/P19-1552
[20]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
[21]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237. https://doi.org/10.18653/v1/N18-1202
[22]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[23]
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A. Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Computational Linguistics and Intelligent Text Processing (CICLing'02). Springer-Verlag, Berlin, 1–15.
[24]
Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. 2018a. DRCD: A Chinese Machine Reading Comprehension Dataset. arxiv:cs.CL/1806.00920
[25]
Yutong Shao, Rico Sennrich, Bonnie Webber, and Federico Fancellu. 2018b. Evaluating machine translation performance on Chinese idioms with a blacklist method. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L18-1005.
[26]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE 2.0: A continual pre-training framework for language understanding. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 8968–8975. https://doi.org/10.1609/aaai.v34i05.6428
[27]
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 1556–1566. https://doi.org/10.3115/v1/P15-1150
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
[29]
Lei Wang and Shiwen Yu. 2010. Construction of Chinese idiom knowledge-base and its applications. In Proceedings of the 2010 Workshop on Multiword Expressions: From Theory to Applications. Coling 2010 Organizing Committee, 11–18. https://www.aclweb.org/anthology/W10-3703.
[30]
Shuohang Wang and Jing Jiang. 2017. A Compare-aggregate model for matching text sequences. In 5th International Conference on Learning Representations (ICLR'17), Conference Track Proceedings. https://openreview.net/forum?id=HJTzHtqee.
[31]
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, and Luo Si. 2020. StructBERT: Incorporating language structures into pre-training for deep language understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=BJgQ4lSFPH.
[32]
Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. 2020. CLUE: A Chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, (Online)4762–4772. https://www.aclweb.org/anthology/2020.coling-main.419.
[33]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems. 5754–5764.
[34]
Xu Linhong Lin Hongfei Pan Yu and Ren Hui Chen Jianmei. 2008. Constructing the affective lexicon ontology. Journal of the China Society for Scientific and Technical Information 2 (2008), 6.
[35]
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1441–1451. https://doi.org/10.18653/v1/P19-1139
[36]
Chujie Zheng, Minlie Huang, and Aixin Sun. 2019. ChID: A large-scale Chinese idiom dataset for cloze test. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 778–787. https://doi.org/10.18653/v1/P19-1075

Cited By

View all
  • (2024)Semantics of Multiword Expressions in Transformer-Based Models: A SurveyTransactions of the Association for Computational Linguistics10.1162/tacl_a_0065712(593-612)Online publication date: 30-Apr-2024
  • (2023)Retrospective Multi-granularity Fusion Network for Chinese Idiom Cloze-style Reading ComprehensionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360337022:7(1-20)Online publication date: 20-Jul-2023
  • (2023)Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained BaselinesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359380622:6(1-24)Online publication date: 19-Jun-2023
  • Show More Cited By

Index Terms

  1. A BERT-Based Two-Stage Model for Chinese Chengyu Recommendation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 6
    November 2021
    439 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3476127
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2021
    Accepted: 01 February 2021
    Revised: 01 January 2021
    Received: 01 March 2020
    Published in TALLIP Volume 20, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Question answering
    2. Chengyu recommendation
    3. idiom understanding

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Research Foundation, Singapore
    • International Research Centres in Singapore Funding Initiative

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 19 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Semantics of Multiword Expressions in Transformer-Based Models: A SurveyTransactions of the Association for Computational Linguistics10.1162/tacl_a_0065712(593-612)Online publication date: 30-Apr-2024
    • (2023)Retrospective Multi-granularity Fusion Network for Chinese Idiom Cloze-style Reading ComprehensionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360337022:7(1-20)Online publication date: 20-Jul-2023
    • (2023)Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained BaselinesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359380622:6(1-24)Online publication date: 19-Jun-2023
    • (2023)Geometry Arithmetic Problem Recommendation Based on Scene-Enhanced BERT2023 International Conference on Intelligent Education and Intelligent Research (IEIR)10.1109/IEIR59294.2023.10391228(1-6)Online publication date: 5-Nov-2023
    • (2023)A Prompt-Based Representation Individual Enhancement Method for�Chinese Idiom Reading ComprehensionDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_50(682-698)Online publication date: 15-Apr-2023
    • (2022)Design of Brand Business Model Based on Big Data and Internet of Things Technology ApplicationComputational Intelligence and Neuroscience10.1155/2022/91898052022Online publication date: 1-Jan-2022
    • (2022)A Chinese Named Entity Recognition Model of Maintenance Records for Power Primary Equipment Based on Progressive Multitype Feature FusionComplexity10.1155/2022/81142172022(1-11)Online publication date: 7-Feb-2022
    • (2022)Waveform Feature Extraction of Intelligent Singing Skills under the Background of Internet of ThingsMobile Information Systems10.1155/2022/46388012022Online publication date: 28-Jun-2022

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media