skip to main content
10.1145/1255175.1255242acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Retrieval in text collections with historic spelling using linguistic and spelling variants

Published: 18 June 2007 Publication History

Abstract

We present a new approach for the retrieval of texts with non-standard spelling, which is important for historic texts e.g. in English or German. In this paper, we describe the overall architecture of our system, followed by its evaluation. Given a search term as lemma, we use a dictionary of contemporary German for finding all inflected and derived forms of the lemma. Then we apply transformation rules (derived from training data) for generating historic spelling variants. For the evaluation, we regard the resulting retrieval quality. The experimental results show that we can improve the retrieval quality for historic collections substantially.

References

[1]
D. Archer, A. Ernst-Gerlach, S. Kempken, T. Pilz and P. Rayson: The identification of spelling variants in English and German historical texts: manual or automatic? In Proceedings DH06, Paris, France, July 2006.
[2]
P. S. Baker: Introduction to Old English. Blackwell Publishing, 2007, ISBN 1405152729.
[3]
D. Biella, E. Dyllong, H. Kaiser, W. Luther and T. Mittmann: Edition �lectronique de la r�ception de Nietzsche des ann�es 1865 � 1945. In ICHIM03 015C. Paris, France, September 2003.
[4]
D. Biella, E. Dyllong, W. Luther and T. Pilz: An On-line Literature Research System with Rule-Based Search. In Proc. of the 4th European Conference on e-Learning (ECEL2005), Amsterdam, 2005.
[5]
J. Cendrowska: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies, 27(4), pp. 349--370.1987.
[6]
A. Ernst-Gerlach, N. Fuhr: Generating Search Term Variants for Text Collections with Historic Spellings. In {8}
[7]
R. Ferber: Information Retrieval - Suchmodelle und Data-Mining-Verfahren f�r Textsammlungen und das Web. ISBN 3898642135, dpunkt.verlag, 2003.
[8]
M. Lalmas, A. MacFarlane, S. Rueger, A. Trombos, T. Tsikrika and A. Yavlinsky (eds): Advances in Information Retrieval - 28th European Conference on IR Research, ECIR 2006. London, UK, April 10-12 2006, Lecture Notes in Computer Science, Vol. 3936, Springer Verlag, Heildelberg 2006 ISBN 3540333479.
[9]
R. Keller: Die Deutsche Sprache und ihre historische Entwicklung. Helmut Buske Verlage, Hamburg, 1995.
[10]
S. Kempken, W. Luther and T. Pilz: Comparison of distance measures for historical spelling variants. In Artificial Intelligence in Theory and Practice IFIP Series 217 pp. 295--304, Springer, 2006, ISBN 9780387346540.
[11]
M. Koolen, F. Adriaans, J. Kamps and M. de Rijke: A Cross-Language Approach to Historic Document Retrieval. In {8}.
[12]
H. Nottelmann: Inside PIRE: An extensible, open-source IR engine based on probabilistic logics. Technical Report, University of Duisburg-Essen,2005.
[13]
U. Quasthoff: Projekt Der Deutsche Wortschatz. In Heyer, G., Wolff, Ch. (eds.) (1998). Linguistik und neue Medien. In Proceedings from the GLDV-Tagung, 17.-19. M�rz 1997 at Leipzig, Deutscher Universit�tsverlag, pp. 93--99, 1998.
[14]
U. Quasthoff: Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values. In Proceedings of the first International Conference on Language Resources & Evaluation, pp. 853--856, ELRA 1998.
[15]
C. Peters (Hrsg.): Cross-Language Information Retrieval and Evaluation, Vol. 2069, Lecture Notes in Computer Science, Heidelberg et al. Springer. 2001.
[16]
U. Pfeifer, T. Poersch, and N. Fuhr: Retrieval Effectiveness of Proper Name Search Methods. Information Processing and Management, Vol. 32, No. 6, pp. 667--669. 1996.
[17]
T. Pilz: Unscharfe Suche in Textdatenbanken mitnichtstandardisierter Rechtschreibung am Beispiel vonFrakturtexten zur Nietzsche-Rezeption. Staatsexamensarbeit, Universit&3228;t Duisburg-Essen, 2003.
[18]
P. Rayson, D. Archer and N. Smith: VARD versus Word. A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings of the Corpus Linguistics 2005 conference, Birmingham, UK. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal, Vol. 1, No. 1., 2005.
[19]
J. Strunk: Information Retrieval for Languages that lack a fixed orthography. 2003. http://www.linguistics.ruhr-uni-bochum.de/~strunk/LSreport.pdf.
[20]
J. Zobel and P. Dart: Phonetic String Matching: Lessons from Information Retrieval. 1996. In H. -P. Frei, D. Harman, P. Sch�uble, R. Wilkinson (eds.): Proceedings 19th International Conference on Research and Development in Information Retrieval (SIGIR), pp. 166--172, New York, 1996.

Cited By

View all

Index Terms

  1. Retrieval in text collections with historic spelling using linguistic and spelling variants

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
    June 2007
    534 pages
    ISBN:9781595936448
    DOI:10.1145/1255175
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. historic documents
    2. rule-based search
    3. spelling variants

    Qualifiers

    • Article

    Conference

    JCDL07
    JCDL07: Joint Conference on Digital Libraries
    June 18 - 23, 2007
    BC, Vancouver, Canada

    Acceptance Rates

    Overall Acceptance Rate 415 of 1,482 submissions, 28%

    Upcoming Conference

    JCDL '24
    The 2024 ACM/IEEE Joint Conference on Digital Libraries
    December 16 - 20, 2024
    Hong Kong , China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical DocumentsMultilingualism and Bilingualism10.5772/intechopen.72421Online publication date: 30-May-2018
    • (2018)Lemmatization for Ancient Languages: Rules or Neural Networks?Artificial Intelligence and Natural Language10.1007/978-3-030-01204-5_4(35-47)Online publication date: 27-Sep-2018
    • (2017)Journeys of the PastProceedings of the 11th Workshop on Geographic Information Retrieval10.1145/3155902.3155906(1-10)Online publication date: 30-Nov-2017
    • (2016)A depth-first branch-and-bound algorithm for geocoding historic itinerary tablesProceedings of the 10th Workshop on Geographic Information Retrieval10.1145/3003464.3003467(1-10)Online publication date: 31-Oct-2016
    • (2016)Accounting for Language Changes Over Time in Document Similarity SearchACM Transactions on Information Systems10.1145/293467135:1(1-26)Online publication date: 3-Sep-2016
    • (2016)Information retrieval from historical newspaper collections in highly inflectional languagesJournal of the Association for Information Science and Technology10.1002/asi.2337967:12(2928-2946)Online publication date: 1-Dec-2016
    • (2015)Geocoding place names from historic route descriptionsProceedings of the 9th Workshop on Geographic Information Retrieval10.1145/2837689.2837698(1-2)Online publication date: 26-Nov-2015
    • (2015)Visions and open challenges for a knowledge-based culturomicsInternational Journal on Digital Libraries10.1007/s00799-015-0139-115:2-4(169-187)Online publication date: 1-Apr-2015
    • (2013)On the applicability of word sense discrimination on 201 years of modern englishInternational Journal on Digital Libraries10.1007/s00799-013-0105-813:3-4(135-153)Online publication date: 1-Sep-2013
    • (2012)Developing a Digital Library of Historical Records in Traditional Mongolian ScriptInternational Journal of Digital Library Systems10.4018/jdls.20120101033:1(33-52)Online publication date: 1-Jan-2012
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media