Abstract
We present a set of language resources and tools—a morphological parser, a morphological disambiguator, and a text corpus—for exploiting Turkish morphology in natural language processing applications. The morphological parser is a state-of-the-art finite-state transducer-based implementation of Turkish morphology. The disambiguator is based on the averaged perceptron algorithm and has the best accuracy reported for Turkish in the literature. The text corpus has been compiled from the web and contains about 500 million tokens. This is the largest Turkish web corpus published.
Notes
Personal communication.
All resources are available at http://www.cmpe.boun.edu.tr/~hasim.
References
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007). OpenFst: A general and efficient weighted finite-state transducer library. In CIAA, pp. 11–23.
Anderson, S. (1992). A-Morphous morphology. Cambridge: Cambridge University Press.
Antworth, E. L. (1990). PC-KIMMO: A two-level processor for morphological analysis. In Occasional Publications in Academic Computing.
Aronoff, M. (1993). Morphology by itself: Stems and inflectional classes. Cambridge: MIT Press.
Bozşahin, C. (2002). The combinatory morphemic lexicon. Computational Linguistics, 28(2), 145–186.
Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP.
Collins, M., & Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL, pp. 263–270.
Ezeiza, N., Alegria, I., Arriola, J. M., Urizar, R., & Aduriz, I. (1998). Combining stochastic and rule-based methods for disambiguation in agglutinative languages. In COLING-ACL.
Freund, Y., & Schapire, R. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.
Göksel, A., & Kerslake, C. (2005). Turkish: A comprehensive grammar. London: Routledge.
Güngör, T. (1995). Computer processing of Turkish: Morphological and lexical investigation. Ph.D. thesis, Boğaziçi University.
Hajic, J., & Hladká, B. (1998). Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In COLING-ACL, pp. 483–490.
Hakkani-Tür, D. Z., Oflazer, K., & Tür, G. (2002). Statistical morphological disambiguation for agglutinative languages. Computers and the Humanities, 36(4).
Halle, M., & Marantz, A. (1993). Distributed morphology and the pieces of inflection. In The View from Building 20 (pp 111–176). Cambridge: MIT Press.
Kaplan, R. M., & Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20(3), 331–378.
Karttunen, L., & Beesley, K. R. (1992). Two-level rule compiler, Technical report. Palo Alto, CA: Xerox Palo Alto Research Center.
Karttunen, L., Koskenniemi, K., & Kaplan, R. M. (1987). A compiler for two-level phonological rules. In Tools for morphological analysis. Palo Alto, CA: Center for the Study of Language and Information, Stanford University.
Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992). Two-level morphology with composition. In COLING, 141–148.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–348.
Koskenniemi, K. (1984). A general computational model for word-form recognition and production. In ACL, pp. 178–181.
Lewis, G. (2001). Turkish grammar. Oxford: Oxford University Press.
Liu, V., & Curran, J. R. (2006). Web text corpus for natural language processing. In EACL.
Megyesi, B. (1999). Improving Brill’s PoS tagger for an agglutinative language. In EMNLP/VLC.
Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational Linguistics, 23(2), 269–311.
Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing, 9(2), 137–148.
Oflazer, K., & Inkelas, S. (2006). The architecture and the implementation of a finite state pronunciation lexicon for Turkish. Computer Speech and Language, 20(1), 80–106.
Oflazer, K., & Tür, G. (1996). Combining hand-crafted rules and unsupervised learning in constraint-based morphological disambiguation. In EMNLP, (pp. 69–81). Somerset, NJ: ACL.
Oflazer, K., & Tür, G. (1997). Morphological disambiguation by voting constraints. In ACL, (pp. 222–229).
Oflazer, K., Say, B., Hakkani-Tür, D. Z., & Tür, G. (2003). Building a Turkish treebank. In Building and exploiting syntactically-annotated corpora. Dordrecht: Kluwer.
Öztaner, S. M. (1996). A word grammar of Turkish with morphophonemic rules. Master’s thesis, Middle East Technical University.
Sak, H., Güngör, T., & Saraçlar, M. (2007). Morphological disambiguation of Turkish text with perceptron algorithm. In CICLing 2007, (vol. LNCS 4394, pp. 107–118).
Sak, H., Güngör, T., & Saraçlar, M. (2009). A stochastic finite-state morphological parser for Turkish. In ACL-IJCNLP 2009, (pp. 273–276).
Salor, Ö., Pellom, B. L., Ciloglu, T., Hacioglu, K., & Demirekler, M. (2002). On developing new text and audio corpora and speech recognition tools for the Turkish language. In ICSLP.
Say, B., Zeyrek, D., Oflazer, K., & Özge, U. (2002). Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the eleventh international conference of Turkish linguistics.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.
Yüret, D., & Türe, F. (2006). Learning morphological disambiguation rules for Turkish. In HLT-NAACL.
Acknowledgments
This work was supported by the Boğaziçi University Research Fund under the grant numbers 06A102 and 08M103, the Scientific and Technological Research Council of Turkey (TÜB\(\dot{\hbox{I}}\)TAK) under the grant number 107E261, the Turkish State Planning Organization (DPT) under the TAM Project number 2007K120610. Murat Saraçlar is supported by the TUBA-GEBIP award. Haşim Sak is supported by TÜB\(\dot{\hbox{I}}\)TAK B\(\dot{\hbox{I}}\)DEB 2211. The authors would like to thank to Kemal Oflazer and Deniz Yüret for the disambiguation data set.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sak, H., Güngör, T. & Saraçlar, M. Resources for Turkish morphological processing. Lang Resources & Evaluation 45, 249–261 (2011). https://doi.org/10.1007/s10579-010-9128-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-010-9128-6