research-article

Efficient parallel construction of suffix trees for genomes larger than main memory

Authors:

Montse FarrerasAuthors Info & Claims

EuroMPI '13: Proceedings of the 20th European MPI Users' Group Meeting

Pages 211 - 216

https://doi.org/10.1145/2488551.2488579

Published: 15 September 2013 Publication History

Abstract

The construction of suffix tree for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become everyday more complex, requiring fast queries to multiple genomes. In this paper we presented Parallel Continuous Flow PCF, a parallel suffix tree construction method that is suitable for very long strings. We tested our method on the construction of suffix tree of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input string grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the Human genome in 7 minutes using 172 nodes.

References

[1]

Bsc: Barcelona supercomputing center, marenostrum system architecture. http://www.bsc.es.

[2]

Complete human genome from ncbi public collections of dna and rna sequences. ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/.

[3]

A. Apostolico, M. Comin, and L. Parida. Bridging lossy and lossless compression by motif pattern discovery. In General Theory of Information Transfer and Combinatorics, Lecture Notes in Computer Science, volume 4123, pages 793--813, 2006.

Digital Library

[4]

A. Apostolico, M. Comin, and L. Parida. Mining, compressing and classifying with extensible motifs. Algorithms for Molecular Biology, 1(4), 2006.

[5]

A. Apostolico, M. Comin, and L. Parida. Varun: Discovering extensible motifs under saturation constraints. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(4):752--762, October-December 2010.

Digital Library

[6]

A. Apostolico, C. Iliopulos, G. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 1(4), 1988.

[7]

M. Barsky, U. Stege, and A. Thomo. Suffix trees for inputs larger than main memory. Information Systems, 36(3):644--654, 2011.

Digital Library

[8]

M. Comin and L. Parida. Subtle motif discovery for the detection of dna regulatory sites. In Proceeding of Asia-Pacific Bioinformatics Conference, pages 27--36, 2007.

[9]

M. Comin and L. Parida. Detection of subtle variations as consensus motifs. Theoretical Computer Science, 395(2-3):158--170, 2008.

Digital Library

[10]

M. Comin and D. Verzotto. The irredundant class method for remote homology detection of protein sequences. Journal of Computational Biology, 18(12):1819--1829, December 2011.

[11]

M. Comin and D. Verzotto. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms for Molecular Biology, 7(34), 2012.

[12]

M. Comin and D. Verzotto. Whole-genome phylogeny by virtue of unic subwords. In Proceedings of 23rd International Workshop on Database and Expert Systems Applications, BIOKDD, pages 190--194, 2012.

Digital Library

[13]

M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM 2000, 47(6):987--1011, 2000.

Digital Library

[14]

A. Ghoting and K. Makarychev. Indexing genomic sequences on the ibm blue gene. In Proceedings of Conference on High Performance Computing Networking, Storage and Analysis (SC), pages 1--11, 2009.

Digital Library

[15]

R. Hariharan. Optimal parallel suffix tree construction. In Proceedings of the Symposium on Theory of Computing, pages 290--299, 1994.

Digital Library

[16]

E. Hunt, M. P. Atkinson, and R. W. Irving. Database indexing for large dna and protein sequence collections. The VLDB Journal, 11:256--271, 2002.

Digital Library

[17]

S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res., 29(22):4633--4642, 2001.

[18]

N. J. Larsson and K. Sadakane. Faster suffix sorting. Theor. Comput. Sci., 387(3):258--272, 2007.

Digital Library

[19]

U. Manber and E. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal of Computing, 22(5):935--948, 1993.

Digital Library

[20]

E. Mansour, A. Allam, S. Skiadopoulos, and P. Kalnis. Era: Efficient serial and parallel suffix tree construction for very long strings. Proceedings of the VLDB Endowment, 5(1):49--60, September 2011.

Digital Library

[21]

E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(1):262--272, 1976.

Digital Library

[22]

E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23:262--272, 1976.

Digital Library

[23]

C. Meek, J. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Databases, pages 910--921, 2003.

Digital Library

[24]

B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. of ACM SIGMOD, pages 833--844, 2007.

Digital Library

[25]

S. Tata, R. A. Hankins, and J. M. Patel. Practical suffix tree construction. In Proc. of VLDB, pages 36--47, 2004.

Digital Library

[26]

Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005.

Digital Library

[27]

E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.

Digital Library

Cited By

Prosperi MMarini SBoucher C(2024)An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI61247.2024.00020(93-102)Online publication date: 3-Jun-2024
https://doi.org/10.1109/ICHI61247.2024.00020
Hlybovets ADidenko V(2023)Constructing Generalized Suffix Trees on Distributed Parallel PlatformsCybernetics and Systems Analysis10.1007/s10559-023-00541-x59:1(49-60)Online publication date: 22-Feb-2023
https://doi.org/10.1007/s10559-023-00541-x
Chaturvedi SNagpal DTiwari V(2022)Lempel-Ziv-Welch (LZW) based Horizontally Scalable Route Prediction2022 International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT55651.2022.10094463(1-6)Online publication date: 25-Nov-2022
https://doi.org/10.1109/INCOFT55651.2022.10094463
Show More Cited By

Recommendations

Suffix trees for inputs larger than main memory

A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix ...
Serial and parallel methods for i/o efficient suffix tree construction
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...
Assembling genomes on large-scale parallel computers

Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroMPI '13: Proceedings of the 20th European MPI Users' Group Meeting

September 2013

289 pages

ISBN:9781450319034

DOI:10.1145/2488551

General Chair:
Jack Dongarra
University of Tennessee
,
Program Chairs:
Javier Garcia Blas
University Carlos III, Spain
,
Jesus Carretero
University Carlos III, Spain

Copyright � 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ARCOS: Computer Architecture and Technology Area, Universidad Carlos III de Madrid

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 September 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministerio de Ciencia e Innovaci�n

Conference

EuroMPI '13

Sponsor:

ARCOS

EuroMPI '13: 20th European MPI Users's Group Meeting

September 15 - 18, 2013

Madrid, Spain

Acceptance Rates

EuroMPI '13 Paper Acceptance Rate 22 of 47 submissions, 47%;

Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
139
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Prosperi MMarini SBoucher C(2024)An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI61247.2024.00020(93-102)Online publication date: 3-Jun-2024
Hlybovets ADidenko V(2023)Constructing Generalized Suffix Trees on Distributed Parallel PlatformsCybernetics and Systems Analysis10.1007/s10559-023-00541-x59:1(49-60)Online publication date: 22-Feb-2023
Chaturvedi SNagpal DTiwari V(2022)Lempel-Ziv-Welch (LZW) based Horizontally Scalable Route Prediction2022 International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT55651.2022.10094463(1-6)Online publication date: 25-Nov-2022
Flick PAluru STaufer MBalaji PPe�a A(2019)Distributed enhanced suffix arraysProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356211(1-17)Online publication date: 17-Nov-2019
Zhu GGuo CLu LHuang ZYuan CGu RHuang Y(2019)DGSTParallel Computing10.1016/j.parco.2019.06.00287:C(87-102)Online publication date: 1-Sep-2019
Brodnik AJekovec M(2018)Sliding Suffix TreeAlgorithms10.3390/a1108011811:8(118)Online publication date: 3-Aug-2018
Tiwari VArya A(2017)Horizontally scalable probabilistic generalized suffix tree (PGST) based route prediction using map data and GPS tracesJournal of Big Data10.1186/s40537-017-0085-44:1Online publication date: 19-Jul-2017
Flick PAluru S(2017)Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.62(12-21)Online publication date: May-2017
Shun J(2017)Shared-Memory Parallelism Can Be Simple, Fast, and ScalableundefinedOnline publication date: 9-Jun-2017
Flick PAluru SKern JVetter J(2015)Parallel distributed memory construction of suffix and longest common prefix arraysProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807609(1-10)Online publication date: 15-Nov-2015

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents