skip to main content
10.1145/1559845.1559931acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Serial and parallel methods for i/o efficient suffix tree construction

Published: 29 June 2009 Publication History

Abstract

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm.
To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.

References

[1]
S. Bedathur and J. Haritsa. Engineering a fast online persistent suffix tree construction. In Proceedings of the IEEE International Conference on Data Engineering 2004.
[2]
S. Bedathur and J. Haritsa. Search optimized suffix tree storage for biological applications. In Proceedings of the IEEE International Conference on High Performance Computing 2005.
[3]
N. Bray, I. Dubchak, and L. Pachter. AVID:A global alignment program. Genome Research 13(1), 2003.
[4]
A. Brown. Constructing genome scale suffix trees. In Proceedings of the Asia-Pacific Bioinformatics Conference 2004.
[5]
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994.
[6]
A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot. Efficient extraction of structured motifs using box links. In Proceedings of the 11th Conference on String Processing and Information Retrieval 2004.
[7]
W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 1994.
[8]
C. Cheung, J. Yu, and H. Lu. Constructing suffix trees for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 2005.
[9]
R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching 2003.
[10]
A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Res. 27(11), 1999.
[11]
A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30(1), 2002.
[12]
M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the Annual Symposium on Foundations of Computer Science 1998.
[13]
D. Gus field. Algorithms on strings, trees, and sequences: Computer science and computational biology Cambridge University Press, Cambridge, 1997.
[14]
D. Gus Field and J. Stoye. Linear time algorithms for Finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences 69(4), 2004.
[15]
E. Hunt, M. Atkinson, and R. Irving. A database index to large biological sequences. In Proceedings of 27th International Conference on Very Large Databases 2001.
[16]
R. Japp. The top-compressed suffix tree:A disk resident index for large sequences. In Proceedings of the Bioinformatics Workshop at the 21st Annual British National Conference on Databases 2004.
[17]
S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res. 29, 2001.
[18]
S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg. Versatile and open software for comparing large genomes. Genome Bio. 5(R12), 2004.
[19]
E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 1976.
[20]
C. Meek, J. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Databases 2003.
[21]
NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html.
[22]
B. Phoophakdee and M. Zaki. Genome-scale disk-based suffix tree indexing. In Proceedings of the ACM International Conference on Management of Data 2007.
[23]
B. Phoophakdee and M. Zaki. Trellis+: An effective approach for indexing massive sequences. In Proceedings of the Pacific Symposium on Biocomputing 2008.
[24]
K. Schurmann and J. Stoye. suffix tree construction and storage with limited main memory. Technical report, Universitat Bielefeld, 2003.
[25]
Y. Tian, S. Tata, R. Hankins, and J. Patel. Practical methods for constructing suffix trees. VLDB Journal 14(3), 2005.
[26]
E. Ukkonen. Constructing suffix trees on-line in linear time. In Proceedings of the IFIP 12th Work Computer Congress on Algorithms, Software, Architecture: Information Processing 1992.
[27]
P. Weiner. Linear pattern matching algorithms. In Proceedings of 14th Annual Symposium on Switch and Automata Theory 1973.
[28]
D. Yankov, E. Keogh, and U. Rebbapragada. Disk-aware discord discovery:Finding unusual time series in tera-byte sized datasets. In Proceedings of the IEEE International Conference on Data Mining 2007.
[29]
O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of 21th International Conference on Research and Development in Information Retrieval 1998.

Cited By

View all
  • (2022)Full-Text (Substring) Indexes in External Memory10.1007/978-3-031-01885-5Online publication date: 2-Mar-2022
  • (2020)Suffix rankProceedings of the VLDB Endowment10.14778/3407790.340786113:12(2787-2800)Online publication date: 14-Sep-2020
  • (2020)Parallel Generalized Suffix Tree Construction for Genomic DataAlgorithms for Computational Biology10.1007/978-3-030-42266-0_1(3-15)Online publication date: 13-Apr-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. disk-based
  2. external memory
  3. genome indexing
  4. parallel
  5. sequence indexing
  6. suffix tree

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '09
Sponsor:
SIGMOD/PODS '09: International Conference on Management of Data
June 29 - July 2, 2009
Rhode Island, Providence, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Full-Text (Substring) Indexes in External Memory10.1007/978-3-031-01885-5Online publication date: 2-Mar-2022
  • (2020)Suffix rankProceedings of the VLDB Endowment10.14778/3407790.340786113:12(2787-2800)Online publication date: 14-Sep-2020
  • (2020)Parallel Generalized Suffix Tree Construction for Genomic DataAlgorithms for Computational Biology10.1007/978-3-030-42266-0_1(3-15)Online publication date: 13-Apr-2020
  • (2019)Distributed enhanced suffix arraysProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356211(1-17)Online publication date: 17-Nov-2019
  • (2019)DGSTParallel Computing10.1016/j.parco.2019.06.00287:C(87-102)Online publication date: 1-Sep-2019
  • (2017)Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.62(12-21)Online publication date: May-2017
  • (2017)A Distributed Approach for Constructing Generalized Suffix Tree on Spark by Using Optimized Elastic Range Algorithm2017 Fifth International Conference on Advanced Cloud and Big Data (CBD)10.1109/CBD.2017.28(117-122)Online publication date: Aug-2017
  • (2015)Suffix Tree Constructing Algorithm for Datasets with Discrete ContentsApplied Mechanics and Materials10.4028/www.scientific.net/AMM.733.867733(867-870)Online publication date: Feb-2015
  • (2015)Parallel distributed memory construction of suffix and longest common prefix arraysProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807609(1-10)Online publication date: 15-Nov-2015
  • (2015)String AlgorithmsDistributed and Sequential Algorithms for Bioinformatics10.1007/978-3-319-24966-7_5(81-110)Online publication date: 1-Nov-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media