skip to main content
10.5555/1315451.1315529dlproceedingsArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

OASIS: an online and accurate technique for local-alignment searches on biological sequences

Published: 09 September 2003 Publication History

Abstract

A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable. The alternative to BLAST is to use an accurate algorithm, such as the Smith-Waterman (S-W) algorithm. However, these accurate algorithms are computationally very expensive, which limits their use in practice. This paper takes on the challenge of designing an accurate and efficient algorithm for evaluating local-alignment searches.
To meet this goal, we propose a novel search algorithm, called OASIS. This algorithm employs a dynamic programming A*-search driven by a suffix-tree index that is built on the input data set. We experimentally evaluate OASIS and demonstrate that for an important class of searches, in which the query sequence lengths are small, OASIS is more than an order of magnitude faster than S-W. In addition, the speed of OASIS is comparable to BLAST. Furthermore, OASIS returns results in decreasing order of the matching score, making it possible to use OASIS in an online setting. Consequently, we believe that it may now be practically feasible to query large biological sequence data sets using an accurate local-alignment search algorithm.

References

[1]
{1} S. Altschul and W. Gish. Local Alignment Statistics. Methods Enzymol, 266:460-480, 1996.
[2]
{2} S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403-410, 1990.
[3]
{3} S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research, 25(17):3389-3402, 1997.
[4]
{4} P. Bieganski, J. Riedi, J. V. Carlis, and E. F. Retzel. Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation. In Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, pages 35-44, 1994.
[5]
{5} W. P. Birmingham, B. Pardo, C. Meek, and J. Shifrin. The MusArt Music-Retrieval System: An Overview. D-Lib Magazine, 8(2), 2002.
[6]
{6} BLAST Program Selection Guide, National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/BLAST/producttable.html, February 2003.
[7]
{7} BLAST download site: http://www.ncbi.nlm.nih.gov/BLAST, 2003.
[8]
{8} S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram Based Database Searching Using a Suffix Array (QUASAR). In RECOMB, pages 77-83, 1999.
[9]
{9} E. Ch�vez and G. Navarro. A Metric Index for Approximate String Matching. In LATIN, pages 181-195, 2002.
[10]
{10} B. F. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A Fast Index for Semistructured Data. In VLDB, pages 341-350, 2001.
[11]
{11} A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of Whole Genomes. Nucleic Acids Research, 27(11):2369-2376, 1999.
[12]
{12} A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast Algorithms for Large-scale Genome Alignment and Comparison. Nucleic Acids Research, 30(11):2478-2483, 2002.
[13]
{13} C. Dwan. Speedup at What Cost? An Evaluation of Heuristic vs. Complete Homology Search Techniques. Presented at Bioinformatics Technology Conference, Tucson, Arizona, 2002.
[14]
{14} R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction. Algorithmica, 19(3):331-353, 1997.
[15]
{15} H. Huang, C. Xiao, and C. Wu. ProClass Protein Family Database. Nucleic Acids Research, 28(1):273-276, 2000.
[16]
{16} E. Hunt, M. P. Atkinson, and R. W. Irving. A Database Index to Large Biological Sequences. In VLDB, pages 139-148, 2001.
[17]
{17} H. V. Jagadish, O. Kapitskaia, R. T. Ng, and D. Srivastava. One-dimensional and Multi-dimensional Substring Selectivity Estimation. VLDB Journal, 9(3):214-230, 2000.
[18]
{18} T. Kahveci and A. K. Singh. An Efficient Index Structure for String Databases. In VLDB, pages 351-360, 2001.
[19]
{19} K. Kelly and P. Labute. The A* search and Applications to Sequence Alignment. http://www.chemcomp.com/article/astar.htm, 1996.
[20]
{20} W. J. Kent. BLAT: The BLAST-like Alignment Tool. Genome Research, 12(4):656-664, 2002.
[21]
{21} H. Kobayashi and H. Imai. Improvement of the A* Algorithm for Multiple Sequence Alignment. Genome Informatics , 9:120-130, 1998.
[22]
{22} S. Kurtz. Reducing the Space Requirement of Suffix Trees. Software - Practice and Experience, 29(13):1149-1171, 1999.
[23]
{23} S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Research, 29(22):4633-4642, 2001.
[24]
{24} B. Ma, J. Tromp, and M. Li. PatternHunter: Faster and More Sensitive Homology Search. Bioinformatics, 18(3):440-445, 2002.
[25]
{25} E. M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(2):262-272, 1976.
[26]
{26} C. Meek and W. Birmingham. Johnny Can't Sing. In ISMIR, pages 124-132, 2002.
[27]
{27} D. Morrison. PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric. Journal of the ACM, 15(4):514-534, 1968.
[28]
{28} H. Nash and D. Blair. Comparing Algorithms for Large-scale Sequence Analysis. In BIBE, pages 89-96, 2001.
[29]
{29} J. Ogasawara and S. Morishita. Practical Software for Aligning ESTs to Human Genome. In CPM, pages 1-16, 2002.
[30]
{30} B. C. Ooi, H. H. Pang, H. Wang, L. Wong, and C. Yu. Fast Filter-and-Refine Algorithms for Subsequence Selection. In IDEAS, pages 243-255, 2002.
[31]
{31} S. Park, W. W. Chu, J. Yoon, and J. Won. Similarity Search of Time-Warped Subsequences Via a Suffix Tree. Information Systems, 28(7):867-883, 2003.
[32]
{32} W. R. Pearson and D. J. Lipman. Improved Tools for Biological Sequence Comparison. Proceedings of the National Academy of Sciences, 85(8):2444-2448, 1988.
[33]
{33} C. Sahinalp, M. Tasan, J. Macker, and M. Ozsoyoglu. Distance Based Indexing for Sequence Proximity Search. In ICDE, 2003.
[34]
{34} P. Sellers. The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms, 1(4):359-373, 1980.
[35]
{35} E. Shpaer, M. Robinson, D. Yee, J. Candlin, R. Mines, and T. Hunkapiller. Sensitivity and Selectivity in Protein Similarity Searches: A Comparison of Smith-Waterman in Hard-ware to BLAST and FASTA. Genomics, 38:179-191, 1996.
[36]
{36} T. Smith and M. Waterman. Identification of Common Molecular Subsequences. Journal of Molecular Biology, 147:195-197, 1981.
[37]
{37} SWISS-PROT Database, European Bioinformatics Institute, http://www.ebi.ac.uk/swissprot/, 2002.
[38]
{38} E. Ukkonen. Constructing Suffix Trees On-line in Linear Time. In Proceedings of the 12th IFIP World Computer Congress, pages 484-492, 1992.
[39]
{39} H. E. Williams and J. Zobel. Indexing and Retrieval for Genomic Databases. IEEE TKDE, 14(1):63-78, 2002.

Cited By

View all
  • (2018)PigeonringProceedings of the VLDB Endowment10.14778/3275536.327553912:1(28-42)Online publication date: 1-Sep-2018
  • (2014)A simple parallel cartesian tree algorithm and its application to parallel suffix tree constructionACM Transactions on Parallel Computing10.1145/26616531:1(1-20)Online publication date: 3-Oct-2014
  • (2013)Efficient parallel construction of suffix trees for genomes larger than main memoryProceedings of the 20th European MPI Users' Group Meeting10.1145/2488551.2488579(211-216)Online publication date: 15-Sep-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
VLDB '03: Proceedings of the 29th international conference on Very large data bases - Volume 29
September 2003
1134 pages

Sponsors

  • VLDB Endowment: Very Large Database Endowment

Publisher

VLDB Endowment

Publication History

Published: 09 September 2003

Qualifiers

  • Article

Conference

VLDB '03
Sponsor:
  • VLDB Endowment
VLDB '03: Very large data bases
September 9 - 12, 2003
Berlin, Germany

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2018)PigeonringProceedings of the VLDB Endowment10.14778/3275536.327553912:1(28-42)Online publication date: 1-Sep-2018
  • (2014)A simple parallel cartesian tree algorithm and its application to parallel suffix tree constructionACM Transactions on Parallel Computing10.1145/26616531:1(1-20)Online publication date: 3-Oct-2014
  • (2013)Efficient parallel construction of suffix trees for genomes larger than main memoryProceedings of the 20th European MPI Users' Group Meeting10.1145/2488551.2488579(211-216)Online publication date: 15-Sep-2013
  • (2013)Fast computation of entropic profiles for the detection of conservation in genomesProceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics10.1007/978-3-642-39159-0_25(277-288)Online publication date: 17-Jun-2013
  • (2012)ALAEProceedings of the VLDB Endowment10.14778/2350229.23502655:11(1507-1518)Online publication date: 1-Jul-2012
  • (2011)Compressed directed acyclic word graph with application in local alignmentProceedings of the 17th annual international conference on Computing and combinatorics10.5555/2033094.2033138(503-518)Online publication date: 14-Aug-2011
  • (2011)Embedding-based subsequence matching in time-series databasesACM Transactions on Database Systems10.1145/2000824.200082736:3(1-39)Online publication date: 26-Aug-2011
  • (2010)I/O efficient algorithms for serial and parallel suffix tree constructionACM Transactions on Database Systems10.1145/1862919.186292235:4(1-37)Online publication date: 12-Oct-2010
  • (2009)A query based approach for mining evolving graphsProceedings of the Eighth Australasian Data Mining Conference - Volume 10110.5555/2449360.2449386(139-150)Online publication date: 1-Dec-2009
  • (2009)Reference-based alignment in large sequence databasesProceedings of the VLDB Endowment10.14778/1687627.16876512:1(205-216)Online publication date: 1-Aug-2009
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media