Skip to main content

A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3518))

Included in the following conference series:

Abstract

In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in the respects of the storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of the two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast accesses of leaf nodes of the trie that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)

    Google Scholar 

  2. Beckmann, N., Kriegel, H., Schneider, R., Seeger, B.: The R*-tree: An efficient and robust access method for points and rectangles. In: Proc. ACM SIGMOD International Conference on Management of Data, pp. 322–331 (1990)

    Google Scholar 

  3. Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Quellette, B.F.: Genbank. Nucleic Acids Research 26(1), 1–7 (1998)

    Article  Google Scholar 

  4. Bieganski, P., Riedl, J., Carlis, J.V.: Generalized suffix trees for biological sequence data: applications and implementation. In: Proc. Hawaii International Conference on System Sciences (1994)

    Google Scholar 

  5. Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucleic Acids Research 27, 2369–2376 (1999)

    Article  Google Scholar 

  6. Giegerich, R., Kurtz, S., Stoye, J.: Efficient Implementation of Lazy Suffix Trees. Softw. Pract. Exp. 33, 1035–1049 (2003)

    Article  Google Scholar 

  7. Goble, R.S.C., Baker, P., Brass: A Classification of tasks in bioinformatics. Bioinformatics 17(2), 180–188 (2001)

    Article  Google Scholar 

  8. Hunt, E., Atkinson, M.P., Irving, R.W.: Database indexing for large DNA and protein sequence collections. VLDB Journal 11(3), 256–271 (2002)

    Article  MATH  Google Scholar 

  9. Kelly, K., Labute, P.: The A* Search and Applications to Sequence Alignment (1996), http://www.chemcomp.com/article/astar.htm

  10. Kurtz, S., Schleiermacher, C.: REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15(5), 426–427 (1999)

    Article  Google Scholar 

  11. Kurtz, S., Choudhuri, J., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: REPuter: the manifold applications of repeat analysis on a genome scale. Nucleic Acids Research 29(22), 4633–4642 (2001)

    Article  Google Scholar 

  12. Meek, C., Patel, J.M., Kasetty, S.: OASIS: An Online and Accurate Technique for Local-Alignment Searches on Biological sequences. In: Proc. VLDB Conference, pp. 920–921 (2003)

    Google Scholar 

  13. Navarro, G., Baeza-Yates, R.: A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete ALgorithms 1(1), 205–239 (2000)

    MathSciNet  Google Scholar 

  14. http://www.ncbi.nlm.nih.gov

  15. Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE Trans. on Knowledge and Data Engineering 8(4), 540–547 (1996)

    Article  Google Scholar 

  16. Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  17. Stephen, G.A.: String Searching Algorithms. World Scientific Publishing, Singapore (1994)

    MATH  Google Scholar 

  18. Ukkonen, E.: Approximate string matching over suffix trees. In: Proc. Combinatorial Pattern Matching, pp. 228–242 (1993)

    Google Scholar 

  19. Wang, H., et al.: BLAST++: A Tool for BLASTing Queries in Batches. In: Proc. Asia-Pacific Bioinformatics Conference, pp. 71–79 (2003)

    Google Scholar 

  20. Williams, H.E., Zobel, J.: Indexing and Retrieval for Genomic Databases. IEEE TKDE 14(1), 63–78 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Won, JI., Yoon, JH., Park, S., Kim, SW. (2005). A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_26

Download citation

  • DOI: https://doi.org/10.1007/11430919_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26076-9

  • Online ISBN: 978-3-540-31935-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics