skip to main content
article

Peer-to-peer information retrieval using shared-content clustering

Published: 01 May 2014 Publication History

Abstract

Peer-to-peer (p2p) networks are used by millions for searching and downloading content. Recently, clustering algorithms were shown to be useful for helping users find content in large networks. Yet, many of these algorithms overlook the fact that p2p networks follow graph models with a power-law node degree distribution. This paper studies the obtained clusters when applying clustering algorithms on power-law graphs and their applicability for finding content. Driven by the observed deficiencies, a simple yet efficient clustering algorithm is proposed, which targets a relaxed optimization of a minimal distance distribution of each cluster with a size balancing scheme. A comparative analysis using a song-similarity graph collected from 1.2 million Gnutella users reveals that commonly used efficiency measures often overlook search and recommendation applicability issues and provide the wrong impression that the resulting clusters are well suited for these tasks. We show that the proposed algorithm performs well on various measures that are well suited for the domain.

References

[1]
Ars technica report on P2P file sharing client market share. http://arstechnica.com/old/content/2008/04/study-bittorren-sees-big-growth-l
[2]
Anglade A, Tiemann M, Vignoli F (2007) Virtual communities for creating shared music channels. In: Proceedings of international symposium on music information retrieval
[3]
Barab�si A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509---512
[4]
Barbehenn M (1998) A note on the complexity of Dijkstra's algorithm for graphs with weighted vertices. IEEE Trans Comput 47(2):263
[5]
Bollobas B, Riordan O (2004) The diameter of a scale-free random graph. Combinatorica 24(1):5---34
[6]
Bradley PS, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. Knowl Discov Data Min (AAAI Press)
[7]
Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: ICML '98. Morgan Kaufmann, San Francisco (pp. 91---99)
[8]
Celma O, Cano P (2008) From hits to niches? Or how popular artists can bias music recommendation and discovery. In: 2nd workshop on large-scale recommender systems and the netflix prize competition, Las Vegas
[9]
Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944---1957
[10]
Dijkstra EW (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1:269---271
[11]
Dongen SV (2000) Performance criteria for graph clustering and markov cluster experiments. Technical report. National Research Institute for Mathematics and Computer Science
[12]
Faloutsos C, Lin K-I (1995) Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: ACM SIGMOD '95
[13]
Fessant FL, Kermarrec AM, Massoulie L (2004) Clustering in peer-to-peer file sharing workloads. In: IPTPS
[14]
Fodor I (2002) A survey of dimension reduction techniques. Technical report. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
[15]
Geleijnse G, Schedl M, Knees P (2007) The quest for ground truth in musical artist tagging in the social web era. In: ISMIR, Vienna
[16]
Gish AS, Shavitt Y, Tankel T (2007) Geographical statistics and characteristics of p2p query strings. In: IPTPS
[17]
Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc Ser A 170(2):301---354
[18]
Herlocker JL, Konstan JA, Terveen LG (2004) Evaluating collaborative filtering recommender systems. ACM Trans Inf Syst 22:5---53
[19]
Hu T, Sung S (2006) Finding centroid clusterings with entropy-based criteria. Knowl Inf Syst 10:505---514
[20]
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264---323
[21]
Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17---40
[22]
Kang U, Tsourakakis C, Faloutsos C (2011) PEGASUS: mining peta-scale graphs. Knowl Inf Syst 27(2):303---325
[23]
Karypis G, Kumar V (1995) A fast and high quality multilevel scheme for partitioning irregular graphs. In: International conference on parallel processing
[24]
Koenigstein N, Shavitt Y, Weinsberg E, Weinsberg U (2010) On the applicability of peer-to-peer data in music information retrieval research. In: ISMIR
[25]
Luo P, Xiong H, L� K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD '07. ACM
[26]
Mowat A, Schmidt R, Schumacher M, Constantinescu I (2008) Extending peer-to-peer networks for approximate search. In: 23rd annual ACM symposium on applied computing
[27]
Mowat A, Schmidt R, Schumacherand M, Constantinescu I (2008) Extending peer-to-peer networks for approximate search. In: ACM SAC '08. ACM, New York. pp 455---459
[28]
Narasimhamurthy A, Greene D, Hurley NJ, Cunningham P (2010) Partitioning large networks without breaking communities. Knowl Inf Syst 25(2):345---369
[29]
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33:2001
[30]
Ars technica report on P2P file sharing client market share. http://arstechnica.com/old/content/2008/04/study-bittorren-sees-big-growth-l
[31]
Pelleg D (2000) Moore A X-means: extending k-means with efficient estimation of the number of clusters. In: The 17th international conference on machine learning. Morgan Kaufmann, Los Altos. pp 727---734
[32]
Platt JC (2004) Fast embedding of sparse music similarity graphs. In: Advances in neural information processing systems
[33]
Priness I, Maimon O, Ben-Gal I (2007) Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinform 8(1):111---123
[34]
Resnick P, Varian HR (1997) Recommender systems. Commun ACM 40(3):56---58
[35]
Ripeanu M (2001) Peer-to-peer architecture case study: Gnutella network. In: First international conference on peer-to-peer computing
[36]
Sakuma J, Kobayashi S (2010) Large-scale k-means clustering with user-centric privacy-preservation. Knowl Inf Syst 25(2):253---279
[37]
Saroiu S, Gummadi KP, Gribble SD (2003) Measuring and analyzing the characteristics of napster and gnutella hosts
[38]
Satuluri V, Parthasarathy S (2009) Scalable graph clustering using stochastic flows: applications to community discovery. In: KDD
[39]
Scholkopf B, Smola A, Muller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299---1319
[40]
Shavitt Y, Weinsberg E, Weinsberg U (2010) Estimating peer similarity using distance of shared files. In: International workshop on peer-to-peer systems (IPTPS)
[41]
Shavitt Y, Weinsberg E, Weinsberg U (2011) Mining music from large-scale peer-to-peer networks. IEEE Multimedia 18(1):14---23
[42]
Shavitt Y, Weinsberg U (2009) Song clustering using peer-to-peer co-occurrences. In: adMIRe
[43]
Sripanidkulchai K, Maggs B, Zhang H (2003) Efficient content location using interest-based locality in peer-to-peer systems. In: INFOCOM
[44]
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD
[45]
Stutzbach D, Rejaie R (2006) On unbiased sampling for unstructured peer-to-peer networks. In: ACM IMC, pp 27---40
[46]
Stutzbach D, Rejaie R, Sen S (2007) Characterizing unstructured overlay topologies in modern P2P file-sharing systems. In: Internet measurement conference (IMC), pp 49---62
[47]
Voulgaris S, Kermarrec A-M, Massouli� L, van Steen M (2004) Exploiting semantic proximity in peer-to-peer content searching. In: 10th international workshop on future trends in distributed computing systems (FTDCS 2004), China
[48]
Wang F, Li P, K�nig AC, Wan M (2012) Improving clustering by learning a bi-stochastic data similarity matrix. Knowl Inf Syst 32(2):351---382
[49]
Wong B, Vigf�sson Y, Sirer EG (2007) Hyperspaces for object clustering and approximate matching in peer-to-peer overlays. In: USENIX HOTOS '07. USENIX, Berkeley, pp 1---6
[50]
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD '09. ACM, New york
[51]
Yang B, Garcia-Molina H (2002) Improving search in peer-to-peer networks. In: ICDCS '02: proceedings of the 22nd international conference on distributed computing systems
[52]
Zaharia MA, Chandel A, Saroiu S, Keshav S (2007) Finding content in file-sharing networks when you can't even spell. In: IPTPS
[53]
Zheng R, Provost F, Ghose A (2007) Social network collaborative filtering. In: 6th workshop on ebusiness (WEB)

Cited By

View all

Index Terms

  1. Peer-to-peer information retrieval using shared-content clustering
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Knowledge and Information Systems
      Knowledge and Information Systems  Volume 39, Issue 2
      May 2014
      245 pages

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 01 May 2014

      Author Tags

      1. Clustering
      2. Data mining
      3. Peer-to-peer
      4. Recommender systems

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media