skip to main content
10.1145/2488551.2488579acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article

Efficient parallel construction of suffix trees for genomes larger than main memory

Published: 15 September 2013 Publication History

Abstract

The construction of suffix tree for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become everyday more complex, requiring fast queries to multiple genomes. In this paper we presented Parallel Continuous Flow PCF, a parallel suffix tree construction method that is suitable for very long strings. We tested our method on the construction of suffix tree of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input string grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the Human genome in 7 minutes using 172 nodes.

References

[1]
Bsc: Barcelona supercomputing center, marenostrum system architecture. http://www.bsc.es.
[2]
Complete human genome from ncbi public collections of dna and rna sequences. ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/.
[3]
A. Apostolico, M. Comin, and L. Parida. Bridging lossy and lossless compression by motif pattern discovery. In General Theory of Information Transfer and Combinatorics, Lecture Notes in Computer Science, volume 4123, pages 793--813, 2006.
[4]
A. Apostolico, M. Comin, and L. Parida. Mining, compressing and classifying with extensible motifs. Algorithms for Molecular Biology, 1(4), 2006.
[5]
A. Apostolico, M. Comin, and L. Parida. Varun: Discovering extensible motifs under saturation constraints. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(4):752--762, October-December 2010.
[6]
A. Apostolico, C. Iliopulos, G. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 1(4), 1988.
[7]
M. Barsky, U. Stege, and A. Thomo. Suffix trees for inputs larger than main memory. Information Systems, 36(3):644--654, 2011.
[8]
M. Comin and L. Parida. Subtle motif discovery for the detection of dna regulatory sites. In Proceeding of Asia-Pacific Bioinformatics Conference, pages 27--36, 2007.
[9]
M. Comin and L. Parida. Detection of subtle variations as consensus motifs. Theoretical Computer Science, 395(2-3):158--170, 2008.
[10]
M. Comin and D. Verzotto. The irredundant class method for remote homology detection of protein sequences. Journal of Computational Biology, 18(12):1819--1829, December 2011.
[11]
M. Comin and D. Verzotto. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms for Molecular Biology, 7(34), 2012.
[12]
M. Comin and D. Verzotto. Whole-genome phylogeny by virtue of unic subwords. In Proceedings of 23rd International Workshop on Database and Expert Systems Applications, BIOKDD, pages 190--194, 2012.
[13]
M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM 2000, 47(6):987--1011, 2000.
[14]
A. Ghoting and K. Makarychev. Indexing genomic sequences on the ibm blue gene. In Proceedings of Conference on High Performance Computing Networking, Storage and Analysis (SC), pages 1--11, 2009.
[15]
R. Hariharan. Optimal parallel suffix tree construction. In Proceedings of the Symposium on Theory of Computing, pages 290--299, 1994.
[16]
E. Hunt, M. P. Atkinson, and R. W. Irving. Database indexing for large dna and protein sequence collections. The VLDB Journal, 11:256--271, 2002.
[17]
S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res., 29(22):4633--4642, 2001.
[18]
N. J. Larsson and K. Sadakane. Faster suffix sorting. Theor. Comput. Sci., 387(3):258--272, 2007.
[19]
U. Manber and E. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal of Computing, 22(5):935--948, 1993.
[20]
E. Mansour, A. Allam, S. Skiadopoulos, and P. Kalnis. Era: Efficient serial and parallel suffix tree construction for very long strings. Proceedings of the VLDB Endowment, 5(1):49--60, September 2011.
[21]
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(1):262--272, 1976.
[22]
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23:262--272, 1976.
[23]
C. Meek, J. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Databases, pages 910--921, 2003.
[24]
B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. of ACM SIGMOD, pages 833--844, 2007.
[25]
S. Tata, R. A. Hankins, and J. M. Patel. Practical suffix tree construction. In Proc. of VLDB, pages 36--47, 2004.
[26]
Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005.
[27]
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.

Cited By

View all
  • (2024)An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI61247.2024.00020(93-102)Online publication date: 3-Jun-2024
  • (2023)Constructing Generalized Suffix Trees on Distributed Parallel PlatformsCybernetics and Systems Analysis10.1007/s10559-023-00541-x59:1(49-60)Online publication date: 22-Feb-2023
  • (2022)Lempel-Ziv-Welch (LZW) based Horizontally Scalable Route Prediction2022 International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT55651.2022.10094463(1-6)Online publication date: 25-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroMPI '13: Proceedings of the 20th European MPI Users' Group Meeting
September 2013
289 pages
ISBN:9781450319034
DOI:10.1145/2488551
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • ARCOS: Computer Architecture and Technology Area, Universidad Carlos III de Madrid

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 September 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. parallel algorithms
  2. suffix tree
  3. whole genome indexing

Qualifiers

  • Research-article

Funding Sources

Conference

EuroMPI '13
Sponsor:
  • ARCOS
EuroMPI '13: 20th European MPI Users's Group Meeting
September 15 - 18, 2013
Madrid, Spain

Acceptance Rates

EuroMPI '13 Paper Acceptance Rate 22 of 47 submissions, 47%;
Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI61247.2024.00020(93-102)Online publication date: 3-Jun-2024
  • (2023)Constructing Generalized Suffix Trees on Distributed Parallel PlatformsCybernetics and Systems Analysis10.1007/s10559-023-00541-x59:1(49-60)Online publication date: 22-Feb-2023
  • (2022)Lempel-Ziv-Welch (LZW) based Horizontally Scalable Route Prediction2022 International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT55651.2022.10094463(1-6)Online publication date: 25-Nov-2022
  • (2019)Distributed enhanced suffix arraysProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356211(1-17)Online publication date: 17-Nov-2019
  • (2019)DGSTParallel Computing10.1016/j.parco.2019.06.00287:C(87-102)Online publication date: 1-Sep-2019
  • (2018)Sliding Suffix TreeAlgorithms10.3390/a1108011811:8(118)Online publication date: 3-Aug-2018
  • (2017)Horizontally scalable probabilistic generalized suffix tree (PGST) based route prediction using map data and GPS tracesJournal of Big Data10.1186/s40537-017-0085-44:1Online publication date: 19-Jul-2017
  • (2017)Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.62(12-21)Online publication date: May-2017
  • (2017)Shared-Memory Parallelism Can Be Simple, Fast, and ScalableundefinedOnline publication date: 9-Jun-2017
  • (2015)Parallel distributed memory construction of suffix and longest common prefix arraysProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807609(1-10)Online publication date: 15-Nov-2015

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media