Skip to main content
Log in

Resources for Turkish morphological processing

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present a set of language resources and tools—a morphological parser, a morphological disambiguator, and a text corpus—for exploiting Turkish morphology in natural language processing applications. The morphological parser is a state-of-the-art finite-state transducer-based implementation of Turkish morphology. The disambiguator is based on the averaged perceptron algorithm and has the best accuracy reported for Turkish in the literature. The text corpus has been compiled from the web and contains about 500 million tokens. This is the largest Turkish web corpus published.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig.�1
Fig.�2
Fig.�3

Notes

  1. Personal communication.

  2. All resources are available at http://www.cmpe.boun.edu.tr/~hasim.

References

  • Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007). OpenFst: A general and efficient weighted finite-state transducer library. In CIAA, pp. 11–23.

  • Anderson, S. (1992). A-Morphous morphology. Cambridge: Cambridge University Press.

    Google Scholar 

  • Antworth, E. L. (1990). PC-KIMMO: A two-level processor for morphological analysis. In Occasional Publications in Academic Computing.

  • Aronoff, M. (1993). Morphology by itself: Stems and inflectional classes. Cambridge: MIT Press.

    Google Scholar 

  • Bozşahin, C. (2002). The combinatory morphemic lexicon. Computational Linguistics, 28(2), 145–186.

    Article  Google Scholar 

  • Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP.

  • Collins, M., & Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL, pp. 263–270.

  • Ezeiza, N., Alegria, I., Arriola, J. M., Urizar, R., & Aduriz, I. (1998). Combining stochastic and rule-based methods for disambiguation in agglutinative languages. In COLING-ACL.

  • Freund, Y., & Schapire, R. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.

    Article  Google Scholar 

  • Göksel, A., & Kerslake, C. (2005). Turkish: A comprehensive grammar. London: Routledge.

    Book  Google Scholar 

  • Güngör, T. (1995). Computer processing of Turkish: Morphological and lexical investigation. Ph.D. thesis, Boğaziçi University.

  • Hajic, J., & Hladká, B. (1998). Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In COLING-ACL, pp. 483–490.

  • Hakkani-Tür, D. Z., Oflazer, K., & Tür, G. (2002). Statistical morphological disambiguation for agglutinative languages. Computers and the Humanities, 36(4).

  • Halle, M., & Marantz, A. (1993). Distributed morphology and the pieces of inflection. In The View from Building 20 (pp 111–176). Cambridge: MIT Press.

  • Kaplan, R. M., & Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20(3), 331–378.

    Google Scholar 

  • Karttunen, L., & Beesley, K. R. (1992). Two-level rule compiler, Technical report. Palo Alto, CA: Xerox Palo Alto Research Center.

  • Karttunen, L., Koskenniemi, K., & Kaplan, R. M. (1987). A compiler for two-level phonological rules. In Tools for morphological analysis. Palo Alto, CA: Center for the Study of Language and Information, Stanford University.

  • Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992). Two-level morphology with composition. In COLING, 141–148.

  • Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–348.

    Article  Google Scholar 

  • Koskenniemi, K. (1984). A general computational model for word-form recognition and production. In ACL, pp. 178–181.

  • Lewis, G. (2001). Turkish grammar. Oxford: Oxford University Press.

    Google Scholar 

  • Liu, V., & Curran, J. R. (2006). Web text corpus for natural language processing. In EACL.

  • Megyesi, B. (1999). Improving Brill’s PoS tagger for an agglutinative language. In EMNLP/VLC.

  • Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational Linguistics, 23(2), 269–311.

    Google Scholar 

  • Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing, 9(2), 137–148.

    Article  Google Scholar 

  • Oflazer, K., & Inkelas, S. (2006). The architecture and the implementation of a finite state pronunciation lexicon for Turkish. Computer Speech and Language, 20(1), 80–106.

    Article  Google Scholar 

  • Oflazer, K., & Tür, G. (1996). Combining hand-crafted rules and unsupervised learning in constraint-based morphological disambiguation. In EMNLP, (pp. 69–81). Somerset, NJ: ACL.

  • Oflazer, K., & Tür, G. (1997). Morphological disambiguation by voting constraints. In ACL, (pp. 222–229).

  • Oflazer, K., Say, B., Hakkani-Tür, D. Z., & Tür, G. (2003). Building a Turkish treebank. In Building and exploiting syntactically-annotated corpora. Dordrecht: Kluwer.

  • Öztaner, S. M. (1996). A word grammar of Turkish with morphophonemic rules. Master’s thesis, Middle East Technical University.

  • Sak, H., Güngör, T., & Saraçlar, M. (2007). Morphological disambiguation of Turkish text with perceptron algorithm. In CICLing 2007, (vol. LNCS 4394, pp. 107–118).

  • Sak, H., Güngör, T., & Saraçlar, M. (2009). A stochastic finite-state morphological parser for Turkish. In ACL-IJCNLP 2009, (pp. 273–276).

  • Salor, Ö., Pellom, B. L., Ciloglu, T., Hacioglu, K., & Demirekler, M. (2002). On developing new text and audio corpora and speech recognition tools for the Turkish language. In ICSLP.

  • Say, B., Zeyrek, D., Oflazer, K., & Özge, U. (2002). Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the eleventh international conference of Turkish linguistics.

  • Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.

    Article  Google Scholar 

  • Yüret, D., & Türe, F. (2006). Learning morphological disambiguation rules for Turkish. In HLT-NAACL.

Download references

Acknowledgments

This work was supported by the Boğaziçi University Research Fund under the grant numbers 06A102 and 08M103, the Scientific and Technological Research Council of Turkey (TÜB\(\dot{\hbox{I}}\)TAK) under the grant number 107E261, the Turkish State Planning Organization (DPT) under the TAM Project number 2007K120610. Murat Saraçlar is supported by the TUBA-GEBIP award. Haşim Sak is supported by TÜB\(\dot{\hbox{I}}\)TAK B\(\dot{\hbox{I}}\)DEB 2211. The authors would like to thank to Kemal Oflazer and Deniz Yüret for the disambiguation data set.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haşim Sak.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sak, H., Güngör, T. & Saraçlar, M. Resources for Turkish morphological processing. Lang Resources & Evaluation 45, 249–261 (2011). https://doi.org/10.1007/s10579-010-9128-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-010-9128-6

Keywords

Navigation