Abstract
In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in the respects of the storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of the two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast accesses of leaf nodes of the trie that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Beckmann, N., Kriegel, H., Schneider, R., Seeger, B.: The R*-tree: An efficient and robust access method for points and rectangles. In: Proc. ACM SIGMOD International Conference on Management of Data, pp. 322–331 (1990)
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Quellette, B.F.: Genbank. Nucleic Acids Research 26(1), 1–7 (1998)
Bieganski, P., Riedl, J., Carlis, J.V.: Generalized suffix trees for biological sequence data: applications and implementation. In: Proc. Hawaii International Conference on System Sciences (1994)
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucleic Acids Research 27, 2369–2376 (1999)
Giegerich, R., Kurtz, S., Stoye, J.: Efficient Implementation of Lazy Suffix Trees. Softw. Pract. Exp. 33, 1035–1049 (2003)
Goble, R.S.C., Baker, P., Brass: A Classification of tasks in bioinformatics. Bioinformatics 17(2), 180–188 (2001)
Hunt, E., Atkinson, M.P., Irving, R.W.: Database indexing for large DNA and protein sequence collections. VLDB Journal 11(3), 256–271 (2002)
Kelly, K., Labute, P.: The A* Search and Applications to Sequence Alignment (1996), http://www.chemcomp.com/article/astar.htm
Kurtz, S., Schleiermacher, C.: REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15(5), 426–427 (1999)
Kurtz, S., Choudhuri, J., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: REPuter: the manifold applications of repeat analysis on a genome scale. Nucleic Acids Research 29(22), 4633–4642 (2001)
Meek, C., Patel, J.M., Kasetty, S.: OASIS: An Online and Accurate Technique for Local-Alignment Searches on Biological sequences. In: Proc. VLDB Conference, pp. 920–921 (2003)
Navarro, G., Baeza-Yates, R.: A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete ALgorithms 1(1), 205–239 (2000)
Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE Trans. on Knowledge and Data Engineering 8(4), 540–547 (1996)
Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Stephen, G.A.: String Searching Algorithms. World Scientific Publishing, Singapore (1994)
Ukkonen, E.: Approximate string matching over suffix trees. In: Proc. Combinatorial Pattern Matching, pp. 228–242 (1993)
Wang, H., et al.: BLAST++: A Tool for BLASTing Queries in Batches. In: Proc. Asia-Pacific Bioinformatics Conference, pp. 71–79 (2003)
Williams, H.E., Zobel, J.: Indexing and Retrieval for Genomic Databases. IEEE TKDE 14(1), 63–78 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Won, JI., Yoon, JH., Park, S., Kim, SW. (2005). A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_26
Download citation
DOI: https://doi.org/10.1007/11430919_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26076-9
Online ISBN: 978-3-540-31935-1
eBook Packages: Computer ScienceComputer Science (R0)