Abstract
We describe several experiments whose goal is to automatically identify idiomatic expressions in written text. We explore two approaches for the task: 1) idiom recognition as outlier detection; and 2) supervised classification of sentences. We apply principal component analysis for outlier detection. Detecting idioms as lexical outliers does not exploit class label information. So, in the following experiments, we use linear discriminant analysis to obtain a discriminant subspace and later use the three nearest neighbor classifier to obtain accuracy. We discuss pros and cons of each approach. All the approaches are more general than the previous algorithms for idiom detection – neither do they rely on target idiom types, lexicons, or large manually annotated corpora, nor do they limit the search space by a particular type of linguistic construction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Birke, J., Sarkar, A.: A clustering approach to the nearly unsupervised recognition of nonliteral language. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), Trento, Italy, pp. 329–336 (2006)
Burnard, L.: The British National Corpus Users Reference Guide. Oxford University Computing Services (2000)
Cacciari, C.: The Place of Idioms in a Literal and Metaphorical World. In: Cacciari, C., Tabossi, P. (eds.) Idioms: Processing, Structure, and Interpretation, pp. 27–53. Lawrence Erlbaum Associates (1993)
Carletta, J.: Assessing Agreement on Classification Tasks: The Kappa Statistic. Computational Linguistics 22(2), 249–254 (1996)
Cilibrasi, R., Vitányi, P.M.B.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)
Cohen, J.: A Coefficient of Agreement for Nominal Scales. Education and Psychological Measurement (20), 37–46 (1960)
Cook, P., Fazly, A., Stevenson, S.: The VNC-Tokens Dataset. In: Proceedings of the LREC Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco (June 2008)
Cowie, A.P., Mackin, R., McCaig, I.R.: Oxford Dictionary of Current Idiomatic English, vol. 2. Oxford University Press (1983)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Degand, L., Bestgen, Y.: Towards Automatic Retrieval of Idioms in French Newspaper Corpora. Literary and Linguistic Computing 18(3), 249–259 (2003)
Fazly, A., Cook, P., Stevenson, S.: Unsupervised Type and Token Identification of Idiomatic Expressions. Computational Linguistics 35(1), 61–103 (2009)
Fellbaum, C.: The Ontological Loneliness of Idioms. In: Schalley, A., Zaefferer, D. (eds.) Ontolinguistics. Mouton de Gruyter (2007)
Fellbaum, C., Geyken, A., Herold, A., Koerner, F., Neumann, G.: Corpus-based Studies of German Idioms and Light Verbs. International Journal of Lexicography 19(4), 349–360 (2006)
Fukunaga, K.: Introduction to statistical pattern recognition. Academic Press (1990)
Glucksberg, S.: Idiom Meanings and Allusional Content. In: Cacciari, C., Tabossi, P. (eds.) Idioms: Processing, Structure, and Interpretation, pp. 3–26. Lawrence Erlbaum Associates (1993)
Jobson, J.: Applied Multivariate Data Analysis, vol. II: Categorical and Multivariate Methods. Springer (1992)
Jolliffe, I.: Principal Component Analysis. Springer, New York (1986)
Katz, G., Giesbrecht, E.: Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In: Proceedings of the ACL 2006 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 12–19 (2006)
Kendall, M., Stuart, A., Ord, J.: Kendall’s Advanced Theory of Statistics, vol. 1: Distribution Theory. John Wiley and Sons (2009)
Krzanowski, W.J.: Principles of Multivariate Analysis. Oxford University Press (2000)
Li, L., Sporleder, C.: A Cohesion Graph Based Approach for Unsupervised Recognition of Literal and Non-literal Use of Multiword Expresssions. In: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (ACL-IJCNLP), Singapore, pp. 75–83 (2009)
Li, L., Sporleder, C.: Using Gaussian Mixture Models to Detect Figurative Language in Context. In: Proceedings of NAACL/HLT 2010 (2010)
Nunberg, G., Sag, I.A., Wasow, T.: Idioms. Language 70(3), 491–538 (1994)
Pado, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007)
Peng, J., Feldman, A., Street, L.: Computing linear discriminants for idiomatic sentence detection. Research in Computing Science, Special Issue: Natural Language Processing and its Applications 46, 17–28 (2010)
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expressions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)
Seaton, M., Macaulay, A. (eds.): Collins COBUILD Idioms Dictionary, 2nd edn. HarperCollins Publishers (2002)
Shyu, M., Chen, S., Sarinnapakorn, K., Chang, L.: A novel anomaly detection scheme based on principal component classifier. In: Proceedings of IEEE International Conference on Data Mining (2003)
Sporleder, C., Li, L.: Unsupervised Recognition of Literal and Non-literal Use of Idiomatic Expressions. In: EACL 2009: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 754–762. Association for Computational Linguistics, Morristown (2009)
Villavicencio, A., Copestake, A., Waldron, B., Lambeau, F.: Lexical Encoding of MWEs. In: Proceedings of the Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, pp. 80–87 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
� 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feldman, A., Peng, J. (2013). Automatic Detection of Idiomatic Clauses. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)