article

Peer-to-peer information retrieval using shared-content clustering

Authors:

Udi WeinsbergAuthors Info & Claims

Knowledge and Information Systems, Volume 39, Issue 2

Pages 383 - 408

https://doi.org/10.1007/s10115-013-0619-9

Published: 01 May 2014 Publication History

Abstract

Peer-to-peer (p2p) networks are used by millions for searching and downloading content. Recently, clustering algorithms were shown to be useful for helping users find content in large networks. Yet, many of these algorithms overlook the fact that p2p networks follow graph models with a power-law node degree distribution. This paper studies the obtained clusters when applying clustering algorithms on power-law graphs and their applicability for finding content. Driven by the observed deficiencies, a simple yet efficient clustering algorithm is proposed, which targets a relaxed optimization of a minimal distance distribution of each cluster with a size balancing scheme. A comparative analysis using a song-similarity graph collected from 1.2 million Gnutella users reveals that commonly used efficiency measures often overlook search and recommendation applicability issues and provide the wrong impression that the resulting clusters are well suited for these tasks. We show that the proposed algorithm performs well on various measures that are well suited for the domain.

References

[1]

Ars technica report on P2P file sharing client market share. http://arstechnica.com/old/content/2008/04/study-bittorren-sees-big-growth-l

[2]

Anglade A, Tiemann M, Vignoli F (2007) Virtual communities for creating shared music channels. In: Proceedings of international symposium on music information retrieval

[3]

Barab�si A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509---512

[4]

Barbehenn M (1998) A note on the complexity of Dijkstra's algorithm for graphs with weighted vertices. IEEE Trans Comput 47(2):263

Digital Library

[5]

Bollobas B, Riordan O (2004) The diameter of a scale-free random graph. Combinatorica 24(1):5---34

Digital Library

[6]

Bradley PS, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. Knowl Discov Data Min (AAAI Press)

Digital Library

[7]

Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: ICML '98. Morgan Kaufmann, San Francisco (pp. 91---99)

Digital Library

[8]

Celma O, Cano P (2008) From hits to niches? Or how popular artists can bias music recommendation and discovery. In: 2nd workshop on large-scale recommender systems and the netflix prize competition, Las Vegas

Digital Library

[9]

Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944---1957

Digital Library

[10]

Dijkstra EW (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1:269---271

Digital Library

[11]

Dongen SV (2000) Performance criteria for graph clustering and markov cluster experiments. Technical report. National Research Institute for Mathematics and Computer Science

Digital Library

[12]

Faloutsos C, Lin K-I (1995) Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: ACM SIGMOD '95

Digital Library

[13]

Fessant FL, Kermarrec AM, Massoulie L (2004) Clustering in peer-to-peer file sharing workloads. In: IPTPS

[14]

Fodor I (2002) A survey of dimension reduction techniques. Technical report. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

[15]

Geleijnse G, Schedl M, Knees P (2007) The quest for ground truth in musical artist tagging in the social web era. In: ISMIR, Vienna

[16]

Gish AS, Shavitt Y, Tankel T (2007) Geographical statistics and characteristics of p2p query strings. In: IPTPS

[17]

Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc Ser A 170(2):301---354

[18]

Herlocker JL, Konstan JA, Terveen LG (2004) Evaluating collaborative filtering recommender systems. ACM Trans Inf Syst 22:5---53

Digital Library

[19]

Hu T, Sung S (2006) Finding centroid clusterings with entropy-based criteria. Knowl Inf Syst 10:505---514

[20]

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264---323

Digital Library

[21]

Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17---40

Digital Library

[22]

Kang U, Tsourakakis C, Faloutsos C (2011) PEGASUS: mining peta-scale graphs. Knowl Inf Syst 27(2):303---325

Digital Library

[23]

Karypis G, Kumar V (1995) A fast and high quality multilevel scheme for partitioning irregular graphs. In: International conference on parallel processing

[24]

Koenigstein N, Shavitt Y, Weinsberg E, Weinsberg U (2010) On the applicability of peer-to-peer data in music information retrieval research. In: ISMIR

[25]

Luo P, Xiong H, L� K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD '07. ACM

Digital Library

[26]

Mowat A, Schmidt R, Schumacher M, Constantinescu I (2008) Extending peer-to-peer networks for approximate search. In: 23rd annual ACM symposium on applied computing

Digital Library

[27]

Mowat A, Schmidt R, Schumacherand M, Constantinescu I (2008) Extending peer-to-peer networks for approximate search. In: ACM SAC '08. ACM, New York. pp 455---459

Digital Library

[28]

Narasimhamurthy A, Greene D, Hurley NJ, Cunningham P (2010) Partitioning large networks without breaking communities. Knowl Inf Syst 25(2):345---369

Digital Library

[29]

Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33:2001

Digital Library

[30]

Ars technica report on P2P file sharing client market share. http://arstechnica.com/old/content/2008/04/study-bittorren-sees-big-growth-l

[31]

Pelleg D (2000) Moore A X-means: extending k-means with efficient estimation of the number of clusters. In: The 17th international conference on machine learning. Morgan Kaufmann, Los Altos. pp 727---734

Digital Library

[32]

Platt JC (2004) Fast embedding of sparse music similarity graphs. In: Advances in neural information processing systems

Digital Library

[33]

Priness I, Maimon O, Ben-Gal I (2007) Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinform 8(1):111---123

[34]

Resnick P, Varian HR (1997) Recommender systems. Commun ACM 40(3):56---58

Digital Library

[35]

Ripeanu M (2001) Peer-to-peer architecture case study: Gnutella network. In: First international conference on peer-to-peer computing

Digital Library

[36]

Sakuma J, Kobayashi S (2010) Large-scale k-means clustering with user-centric privacy-preservation. Knowl Inf Syst 25(2):253---279

Digital Library

[37]

Saroiu S, Gummadi KP, Gribble SD (2003) Measuring and analyzing the characteristics of napster and gnutella hosts

[38]

Satuluri V, Parthasarathy S (2009) Scalable graph clustering using stochastic flows: applications to community discovery. In: KDD

Digital Library

[39]

Scholkopf B, Smola A, Muller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299---1319

Digital Library

[40]

Shavitt Y, Weinsberg E, Weinsberg U (2010) Estimating peer similarity using distance of shared files. In: International workshop on peer-to-peer systems (IPTPS)

Digital Library

[41]

Shavitt Y, Weinsberg E, Weinsberg U (2011) Mining music from large-scale peer-to-peer networks. IEEE Multimedia 18(1):14---23

Digital Library

[42]

Shavitt Y, Weinsberg U (2009) Song clustering using peer-to-peer co-occurrences. In: adMIRe

[43]

Sripanidkulchai K, Maggs B, Zhang H (2003) Efficient content location using interest-based locality in peer-to-peer systems. In: INFOCOM

[44]

Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD

[45]

Stutzbach D, Rejaie R (2006) On unbiased sampling for unstructured peer-to-peer networks. In: ACM IMC, pp 27---40

Digital Library

[46]

Stutzbach D, Rejaie R, Sen S (2007) Characterizing unstructured overlay topologies in modern P2P file-sharing systems. In: Internet measurement conference (IMC), pp 49---62

Digital Library

[47]

Voulgaris S, Kermarrec A-M, Massouli� L, van Steen M (2004) Exploiting semantic proximity in peer-to-peer content searching. In: 10th international workshop on future trends in distributed computing systems (FTDCS 2004), China

Digital Library

[48]

Wang F, Li P, K�nig AC, Wan M (2012) Improving clustering by learning a bi-stochastic data similarity matrix. Knowl Inf Syst 32(2):351---382

[49]

Wong B, Vigf�sson Y, Sirer EG (2007) Hyperspaces for object clustering and approximate matching in peer-to-peer overlays. In: USENIX HOTOS '07. USENIX, Berkeley, pp 1---6

Digital Library

[50]

Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD '09. ACM, New york

Digital Library

[51]

Yang B, Garcia-Molina H (2002) Improving search in peer-to-peer networks. In: ICDCS '02: proceedings of the 22nd international conference on distributed computing systems

Digital Library

[52]

Zaharia MA, Chandel A, Saroiu S, Keshav S (2007) Finding content in file-sharing networks when you can't even spell. In: IPTPS

[53]

Zheng R, Provost F, Ghose A (2007) Social network collaborative filtering. In: 6th workshop on ebusiness (WEB)

Cited By

Franchi EPoggi ATomaiuolo M(2016)BlogracyInternational Journal of Distributed Systems and Technologies10.4018/IJDST.20160401037:2(37-56)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.4018/IJDST.2016040103

Index Terms

Peer-to-peer information retrieval using shared-content clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Understanding overlay characteristics of a large-scale peer-to-peer IPTV system

This article presents results from our measurement and modeling efforts on the large-scale peer-to-peer (p2p) overlay graphs spanned by the PPLive system, the most popular and largest p2p IPTV (Internet Protocol Television) system today. Unlike other ...
Peer-to-peer multimedia applications
MM '06: Proceedings of the 14th ACM international conference on Multimedia

In both academia and industry, peer-to-peer (P2P) applications have attracted great attention. Peer-to-peer file sharing applications, such as Napster, Gnutella, Kazaa, BitTorrent, Skype and PPLive, have witnessed tremendous success among end users. And ...
Understanding churn in peer-to-peer networks
IMC '06: Proceedings of the 6th ACM SIGCOMM conference on Internet measurement

The dynamics of peer participation, or churn, are an inherent property of Peer-to-Peer (P2P) systems and critical for design and evaluation. Accurately characterizing churn requires precise and unbiased information about the arrival and departure of ...

Comments

Information & Contributors

Information

Published In

cover image Knowledge and Information Systems

Knowledge and Information Systems Volume 39, Issue 2

May 2014

245 pages

ISSN:0219-1377

Issue’s Table of Contents

Copyright © Copyright © 2014 Springer-Verlag London.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 May 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Franchi EPoggi ATomaiuolo M(2016)BlogracyInternational Journal of Distributed Systems and Technologies10.4018/IJDST.20160401037:2(37-56)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.4018/IJDST.2016040103

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents