A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment

Won, Jung-Im; Yoon, Jee-Hee; Park, Sanghyun; Kim, Sang-Wook

doi:10.1007/11430919_26

Jung-Im Won²¹,
Jee-Hee Yoon²²,
Sanghyun Park²¹ &
…
Sang-Wook Kim²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3518))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2620 Accesses
1 Citations

Abstract

In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in the respects of the storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of the two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast accesses of leaf nodes of the trie that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Algorithms for String Comparison in DNA Sequences

ND-GiST: A Novel Method for Disk-Resident k-mer Indexing

Algorithms for Indexing Highly Similar DNA Sequences

References

Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Google Scholar
Beckmann, N., Kriegel, H., Schneider, R., Seeger, B.: The R*-tree: An efficient and robust access method for points and rectangles. In: Proc. ACM SIGMOD International Conference on Management of Data, pp. 322–331 (1990)
Google Scholar
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Quellette, B.F.: Genbank. Nucleic Acids Research 26(1), 1–7 (1998)
Article Google Scholar
Bieganski, P., Riedl, J., Carlis, J.V.: Generalized suffix trees for biological sequence data: applications and implementation. In: Proc. Hawaii International Conference on System Sciences (1994)
Google Scholar
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucleic Acids Research 27, 2369–2376 (1999)
Article Google Scholar
Giegerich, R., Kurtz, S., Stoye, J.: Efficient Implementation of Lazy Suffix Trees. Softw. Pract. Exp. 33, 1035–1049 (2003)
Article Google Scholar
Goble, R.S.C., Baker, P., Brass: A Classification of tasks in bioinformatics. Bioinformatics 17(2), 180–188 (2001)
Article Google Scholar
Hunt, E., Atkinson, M.P., Irving, R.W.: Database indexing for large DNA and protein sequence collections. VLDB Journal 11(3), 256–271 (2002)
Article MATH Google Scholar
Kelly, K., Labute, P.: The A* Search and Applications to Sequence Alignment (1996), http://www.chemcomp.com/article/astar.htm
Kurtz, S., Schleiermacher, C.: REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15(5), 426–427 (1999)
Article Google Scholar
Kurtz, S., Choudhuri, J., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: REPuter: the manifold applications of repeat analysis on a genome scale. Nucleic Acids Research 29(22), 4633–4642 (2001)
Article Google Scholar
Meek, C., Patel, J.M., Kasetty, S.: OASIS: An Online and Accurate Technique for Local-Alignment Searches on Biological sequences. In: Proc. VLDB Conference, pp. 920–921 (2003)
Google Scholar
Navarro, G., Baeza-Yates, R.: A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete ALgorithms 1(1), 205–239 (2000)
MathSciNet Google Scholar
http://www.ncbi.nlm.nih.gov
Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE Trans. on Knowledge and Data Engineering 8(4), 540–547 (1996)
Article Google Scholar
Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Stephen, G.A.: String Searching Algorithms. World Scientific Publishing, Singapore (1994)
MATH Google Scholar
Ukkonen, E.: Approximate string matching over suffix trees. In: Proc. Combinatorial Pattern Matching, pp. 228–242 (1993)
Google Scholar
Wang, H., et al.: BLAST++: A Tool for BLASTing Queries in Batches. In: Proc. Asia-Pacific Bioinformatics Conference, pp. 71–79 (2003)
Google Scholar
Williams, H.E., Zobel, J.: Indexing and Retrieval for Genomic Databases. IEEE TKDE 14(1), 63–78 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Yonsei University, Korea
Jung-Im Won & Sanghyun Park
Division of Information Engineering and Telecommunications, Hallym University, Korea
Jee-Hee Yoon
College of Information and Communications, Hanyang University, Korea
Sang-Wook Kim

Authors

Jung-Im Won
View author publications
You can also search for this author in PubMed Google Scholar
Jee-Hee Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Sanghyun Park
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Wook Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Japan Advanced Institute of Science and Technology, Asahidai 1-1, 923-12292, Nomi, Japan
Tu Bao Ho
University of Hong Kong, Pokfulam Road, Hong Kong, China
David Cheung
Department of Computer Science and Engineering, Arizona State University, Tempe, Arizona, USA
Huan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Won, JI., Yoon, JH., Park, S., Kim, SW. (2005). A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_26

Download citation

DOI: https://doi.org/10.1007/11430919_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26076-9
Online ISBN: 978-3-540-31935-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Algorithms for String Comparison in DNA Sequences

ND-GiST: A Novel Method for Disk-Resident k-mer Indexing

Algorithms for Indexing Highly Similar DNA Sequences

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Algorithms for String Comparison in DNA Sequences

ND-GiST: A Novel Method for Disk-Resident k-mer Indexing

Algorithms for Indexing Highly Similar DNA Sequences

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation