skip to main content
research-article

Information asymmetry in Wikipedia across different languages: : A statistical analysis

Published: 07 February 2022 Publication History

Abstract

Wikipedia is the largest web‐based open encyclopedia covering more than 300 languages. Different language editions of Wikipedia differ significantly in terms of their information coverage. In this article, we compare the information coverage in English Wikipedia (most exhaustive) and Wikipedias in 8 other widely spoken languages, namely Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish, and Turkish. We analyze variations in different language editions of Wikipedia in terms of the number of topics covered as well as the amount of information discussed about different topics. Further, as a step towards bridging the information gap, we present WikiCompare—a browser plugin that allows Wikipedia readers to have a comprehensive overview of topics by incorporating missing information from Wikipedia page in other language.

References

[1]
Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. In Proceedings of workshop on new text wikis and blogs and other dynamic text sources. (pp. 62–69). Association for Computational Linguistics. https://aclanthology.org/W06-2810.pdf.
[2]
Adar, E., Skinner, M., & Weld, D. S. (2009). Information arbitrage across multi‐lingual Wikipedia. In Proceedings of the second acm WSDM (pp. 94–103). ACM.
[3]
Balaraman, V., Razniewski, S., & Nutt, W. (2018). Recoin: Relative completeness in wikidata. In Companion of the the web conference 2018 on the web conference 2018 (pp. 1787–1792). ACM.
[4]
Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M., & Gergle, D. (2012). Omnipedia: Bridging the Wikipedia language gap. In Proceedings of SIGCHI (pp. 1075–1084). ACM.
[5]
Barrón‐Cedeno, A., Paramita, M. L., Clough, P., & Rosso, P. (2014). A comparison of approaches for measuring cross‐lingual similarity of Wikipedia articles. In Proceedings of ECIR (pp. 424–429). Springer.
[6]
Bhatia, S., & Jain, A. (2016). Context sensitive entity linking of search queries in enterprise knowledge graphs. In The semantic web ‐ ESWC (pp. 50–54). Springer.
[7]
Bhatia, S., & Vishwakarma, H. (2018). Know thy neighbors, and more!: Studying the role of context in entity recommendation. In Proceedings of 29th HT (pp. 87–95). ACM.
[8]
Blumenstock, J. E. (2008). Size matters: Word count as a measure of quality on Wikipedia. In J. Huai, R. Chen, H.‐W. Hon, & Y. Liu (Eds.), Proceedings of 17th international conference on world wide web (pp. 1095–1096). ACM.
[9]
Callahan, E. S., & Herring, S. C. (2011). Cultural bias in Wikipedia content on famous persons. Journal of the American Society for Information Science and Technology, 62(10), 1899–1915.
[10]
Carmel, D., Roitman, H., & Zwerdling, N. (2009). Enhancing cluster labeling using Wikipedia. In Proceedings of 32nd SIGIR (pp. 139–146). ACM.
[11]
de Melo, G., & Weikum, G. (2010). Untangling the cross‐lingual link structure of Wikipedia. In Proceedings of 48th ACL (pp. 844–853). Association for Computational Linguistics.
[12]
Ferrández, S., Toral, A., Ferrández, O., Ferrández, A., & Munoz, R. (2007). Applying Wikipedias multilingual knowledge to cross‐lingual question answering. In International conference on application of natural language to information systems (pp. 352–363). Springer.
[13]
Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. In Proceedings of the third international workshop on cross lingual information access: Addressing the information need of multilingual societies (pp. 30–37). Association for Computational Linguistics.
[14]
Hale, S. A. (2014). Multilinguals and Wikipedia editing. In Proceedings of acm conference on web science (pp. 99–108). ACM.
[15]
Harige, R., & Buitelaar, P. (2016). Generating a large‐scale entity linking dictionary from Wikipedia link structure and article text. In Proceedings of 10th LREC (pp. 2431–2434). Association for Computational Linguistics.
[16]
Hecht, B., & Gergle, D. (2009). Measuring self‐focus bias in community‐maintained knowledge repositories. In Proceedings of 4th conference on communities and technologies (pp. 11–20). ACM.
[17]
Hecht, B., & Gergle, D. (2010). The tower of babel meets web 2.0: User‐generated content and its applications in a multilingual context. In Proceedings of SIGCHI (pp. 291–300). ACM.
[18]
Hieber, F., & Riezler, S. (2015). Bag‐of‐words forced decoding for cross‐lingual information retrieval. In Proceedings of 2015 NAACL (pp. 1172–1182). Association for Computational Linguistics.
[19]
Katz, G., Shtock, A., Kurland, O., Shapira, B., & Rokach, L. (2014). Wikipedia‐based query performance prediction. In Proceedings of 37th SIGIR (pp. 1235–1238). ACM.
[20]
King, I., & Baeza‐Yates, R. (2009). Weaving services and people on the world wide web. Springer.
[21]
Lam, S. T. K., Uduwage, A., Dong, Z., Sen, S., Musicant, D. R., Terveen, L., & Riedl, J. (2011). Wp:clubhouse? An exploration of Wikipedia's gender imbalance. In Proceedings of the 7th international symposium on wikis and open collaboration (p. 110). ACM.
[22]
Lewoniewski, W. (2017). Completeness and reliability of Wikipedia infoboxes in various languages. In Business information systems workshops (Vol. 303, pp. 295–305). Springer.
[23]
Lewoniewski, W. (2018, 2018). Measures for quality assessment of articles and infoboxes in multilingual Wikipedia. In Business information systems workshops (Vol. 339, pp. 619–633). Springer.
[24]
Lewoniewski, W., Wecel, K., & Abramowicz, W. (2016). Quality and importance of Wikipedia articles in different languages. In Information and software technologies (Vol. 639, pp. 613–624). Springer.
[25]
Lewoniewski, W., Wecel, K., & Abramowicz, W. (2017a). Analysis of references across Wikipedia languages. In Information and software technologies (Vol. 756, pp. 561–573). Springer.
[26]
Lewoniewski, W., Wecel, K., & Abramowicz, W. (2017b). Relative quality and popularity evaluation of multilingual Wikipedia articles. Informatics, 4(4), 43.
[27]
Li, W., Peng, R., Wang, Y., & Yan, Z. (2020). Knowledge graph based natural language generation with adapted pointer‐generator networks. Neurocomputing, 382, 174–187.
[28]
Luyt, B. (2013). History on Wikipedia: In need of a NWICO (new world information and communication order)? The case of Cambodia. Journal of the Association for Information Science and Technology, 64(6), 1193–1202.
[29]
Luyt, B. (2018). Wikipedia's gaps in coverage: Are wikiprojects a solution? A study of the cambodian wikiproject. Online Information Review, 42(2), 238–249.
[30]
Massa, P., & Scrinzi, F. (2013). Manypedia: Comparing language points of view of Wikipedia communities. First Monday, 18(1). https://doi.org/10.5210/fm.v18i1.3939.
[31]
Paramita, M. L., Clough, P. D., & Gaizauskas, R. J. (2017). Using section headings to compute cross‐lingual similarity of Wikipedia articles. In Proceedings of 39th ECIR (pp. 633–639). Springer.
[32]
Park, S., Kim, S., Hale, S. A., Kim, S., Byun, J., & Oh, A. (2015). Multilingual Wikipedia: Editors of primary language contribute to more complex articles. In Ninth international aaai conference on web and social media. Association for the Advancement of ArtificialIntelligence.
[33]
Pfeil, U., Zaphiris, P., & Ang, C. S. (2006). Cultural differences in collaborative authoring of Wikipedia. Journal of Computer‐Mediated Communication, 12(1), 88–113.
[34]
Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia‐based multilingual retrieval model. In Proceedings of 30th ECIR (pp. 522–530). Springer.
[35]
Roy, D., Bhatia, S., & Jain, P. (2020). A topic‐aligned multilingual corpus of Wikipedia articles for studying information asymmetry in low resource languages. In Proceedings of 12th LREC (pp. 2373–2380). European Language Resources Association.
[36]
Royal, C., & Kapila, D. (2009). What's on Wikipedia, and what's not?: Assessing completeness of information. Social Science Computer Review, 27(1), 138–148.
[37]
Shirakawa, M., Nakayama, K., Hara, T., & Nishio, S. (2013). Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities. In Proceedings of 22nd CIKM (pp. 903–908). ACM.
[38]
Wagner, C., García, D., Jadidi, M., & Strohmaier, M. (2015). It's a man's Wikipedia? Assessing gender inequality in an online encyclopedia. In Proceedings of ninth international conference on web and social media (pp. 454–463). Association for the Advancement of ArtificialIntelligence.
[39]
Warncke‐Wang, M., Cosley, D., & Riedl, J. (2013). Tell me more: An actionable quality model for Wikipedia. In Proceedings of 9th international symposium on open collaboration (pp. 1–10). ACM.
[40]
Wu, F., & Weld, D. S. (2007). Autonomously semantifying Wikipedia. In Proceedings of 16th CIKM (pp. 41–50). ACM.
[41]
Wulczyn, E., West, R., Zia, L., & Leskovec, J. (2016). Growing Wikipedia across languages via recommendation. In Proceedings of 25th www (pp. 975–985). ACM.
[42]
Yamada, I., Asai, A., Sakuma, J., Shindo, H., Takeda, H., Takefuji, Y., & Matsumoto, Y. (2020). Wikipedia2vec: An efficient toolkit for learning and visualizing the embeddings of words and entities from Wikipedia. In Proceedings of the 2020 EMNLP (Systems Demonstrations) (pp. 23–30). Association for Computational Linguistics.
[43]
Zhang, B., Lin, Y., Pan, X., Lu, D., May, J., Knight, K., & Ji, H. (2018). Elisa‐edl: A cross‐lingual entity extraction, linking and localization system. In Proceedings of NAACL: Demonstrations (pp. 41–45). Association for Computational Linguistic.

Cited By

View all
  • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
  • (2023)Detecting Cross-Lingual Information Gaps in WikipediaCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587539(581-585)Online publication date: 30-Apr-2023

Index Terms

  1. Information asymmetry in Wikipedia across different languages: A statistical analysis
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Journal of the Association for Information Science and Technology
          Journal of the Association for Information Science and Technology  Volume 73, Issue 3
          March 2022
          147 pages
          ISSN:2330-1635
          EISSN:2330-1643
          DOI:10.1002/asi.v73.3
          Issue’s Table of Contents

          Publisher

          John Wiley & Sons, Inc.

          United States

          Publication History

          Published: 07 February 2022

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 21 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
          • (2023)Detecting Cross-Lingual Information Gaps in WikipediaCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587539(581-585)Online publication date: 30-Apr-2023

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media