CA2357263A1 - New methods for faster and more sensitive homology search in dna sequences - Google Patents

New methods for faster and more sensitive homology search in dna sequences

Info

Publication number
CA2357263A1
CA2357263A1 CA 2357263 CA2357263A CA2357263A1 CA 2357263 A1 CA2357263 A1 CA 2357263A1 CA 2357263 CA2357263 CA 2357263 CA 2357263 A CA2357263 A CA 2357263A CA 2357263 A1 CA2357263 A1 CA 2357263A1
Authority
CA
Grant status
Application
Patent type
Prior art keywords
method
model
spaced
patternhunter
hit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA 2357263
Other languages
French (fr)
Inventor
Ming Li
Bin Ma
John Tromp
Original Assignee
BIOINFORMATICS SOLUTIONS INC.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F19/00Digital computing or data processing equipment or methods, specially adapted for specific applications
    • G06F19/10Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
    • G06F19/22Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology for sequence comparison involving nucleotides or amino acids, e.g. homology search, motif or Single-Nucleotide Polymorphism [SNP] discovery or sequence alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F19/00Digital computing or data processing equipment or methods, specially adapted for specific applications
    • G06F19/10Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology
    • G06F19/24Bioinformatics, i.e. methods or systems for genetic or protein-related data processing in computational molecular biology for machine learning, data mining or biostatistics, e.g. pattern finding, knowledge discovery, rule extraction, correlation, clustering or classification

Description

New Methods For Faster And More Sensitive Homology Search in USIA Sequences Ming Li, Bin Ma, John Tromp Bioinformatics Solutions Inc., 145 Columbia Street West, Waterloo, Ont N2L 3L2, Canada Title: New Methods For Faster And More Sensi- methods to find homologies (or approximately re tive Homology Search in DNA Sequences peating patterns) within one or between two DNA
sequences.
Applicant: Bioinformatics Solutions Inc.
Inventors:
Ming Li 155 Hillcrest Ave, Apt. 1215 Missis-sauga, Ont. L5B 3Z2, Canada Bin Ma Bioinformatics Solutions Inc., 145 Columbia Street West, Unit 2B, Water-loo, Ont N2L 3L2, Canada .lohn Tromp Bioinformatics Solutions Inc.
145 Columbia Street West, Unit 2B, Wa-terloo, Ont N2L 3L2, Canada Our Reference: 08-Our F=ile No.: spec.001 Date Printed: August 28, 2001 1 Introduction The present invention relates generally to the biotechnology subfield of bioinformatics, which studies methods of processing and analyzing ge-nomic and proteomic information. The field of bioinformatics is at the intersection of computer science' and molecular biology.
More specifically, the present invention provides a new method and software that improves current Background of the Invention For the first time in our natural history, we have available (or soon will) complete genomic se-quences of H. Sapiens, C. elegans, A. thaliana, D.
melanogaster, M. musculus, S. pombe, S. cere-visiae, rice, dozens of prokaryote genomes, and hundreds of virus genomes. See (2, 3~ for the ini-tial sequences of the human genome. However, a lot of important information in this enormous and exponentially growing gold mine will go to waste without the proper tools to mine it.
One class of crucial tools is homology search pro-grams for finding similar regions within one or be-tween two DNA sequences.
Genomics studies routinely depend on such ho-urology search tools. Existing search tools already prove inadequate to handle the amount of biologi-cal sequences currently available, as shown in Fig-ures 7, 4, 5, 6. Mlore sensitive and more efficient homology search tools are urgently needed.
The main purpose of this invention is to provide a faster and more sensitive method for finding ho-mologies in one or between two DNA sequences.
This method works especially well for very long se-quences, such as complete genomes.

Many algorithms and programs have been devel-oped for the task. These include FASTA [9J, SIM t (13J, the Blast family (1, 4, 6, 7, 5J, SENSEI [8J, s MUMmer (10J, QUASAR (11J, and REPuter (12J.
Given two long DNA sequences, exhaustively comparing all bases against all bases is well-known to be too slow. Two lines of approach lead to improvements. The first is exemplified by Blast, t which is used routinely by thousands of scientists. s This approach finds short exact "seed" matches r (hits), which are then extended into longer align- i ments. However, when comparing two very long sequences, FASTA, SIM, Blastn (BL2SEQ), WU-Blast, and Psi-Blast run very slow and need large amounts of memory. SENSEI is somewhat faster and uses much less memory than the above pro-grams, but is currently limited to ungapped align-ments. MegaBlast runs quite efficiently with its default gap scores and large seed length of 28 but turns out to have worse output quality and doesn't scale as well to huge sequences. This class of the programs, typically represented by Blast, all de-pend on the strategy of finding short seed matches which are then extended. We will refer to this class , of programs as Blast-type programs. Blast-type , programs exhibit a tradeoff between sensitivity and speed according to the chosen seed size.
Another line of approach, exemplified by MUM-mer, QUASAR and REPuter, uses suffix trees. Suf fix trees suffer from two problems: They are meant to deal with precise matches and are limited to comparison of highly similar sequences [10, 11, 12J.
They are very awkward in handling mismatches.
The second problem with suffix trees is that they have an intrinsic large space requirement. Due to these obstacles, it is not expected that this line of approach would lead to practical homology soft-ware with quality comparable to Blast-type algo-rith ms.

The key to improved homology search is to refine the Blast-type approach to improve sensitivity and speed at the same same.
This invention introduces a novel seed model that simultaneously inr_reases sensitivity and search speed. In addition, this invention also introduces new methods of building gapped alignments. This invention has been implemented in the portable Java program "PatternHunter". At Blast levels of sensitivity, it is able to find homologies between se-quences as large as human chromosomes, in mere hours on a desktop. This by far exceeds the power and quality of competing programs.
On a modern desktop, PatternHunter's running time ranges from seconds for prokaryotic genomes to minutes for arabidopsis chromosomes to hours for human chromosomes, with very modest mem-ory use, and at provably higher sensitivity than the default Blastn.
One particular application of this task is in com-parative genomics where large genomes or chromo-somes such as the human one (2, 3~ need to be com-pared. Another applications is cross species com-parison to assist the sequence assembly in shotgun sequencing. For example, BSI recently has been involved in the moused genome project that uses PatternHunter to find all the homologies between 16 million reads {of about 500 base pairs each) of mouse genome and the 3 gigabases of the human genome.

2 Detailed Description of Pre-ferred Embodiments of the Invention 2.1 Claim l: Super Seeds For Homol-ogy Search.
A dilemma for a Blast type of search is that large seeds lose distant homologies while small ones cre-ates too many random hits which slow down the computation. This invention introduces a new idea that allows PatternHunter to have a higher prob-ability of a hit in a homologaus region, while hav- , ing a lower expected number of random hits, thus , improving sensitivity while increasing speed at the same time.
i Blast looks for matches of k (default k = 11 in Blastn and k = 28 in MegaBlast) consecutive let-ters as seeds. Instead, this invention proposes to , use non-consecutive, or 'spaced' k letters as seeds. i Furthermore, this invention proposes to use spaced seeds that are optimized for maximum sensitivity.
We observe that randomly generated seeds also I
achieve reasonable sensitivity much higher than the consecutive seed. This invention also covers the use of randomly generated spaced seed for the pur-pose of obtaining high sensitivity for the purpose of homology search. Call the relative positions of the k letters a model, and k its weigf~t.
This seemingly simple change has a surprisingly large effect on sensitivity. An appropriately chosen model can have a significantly higher probability of having at (east one hit in a homologous region, compared to Blast's consecutive seed model, even while having a lower expected number of hits. For example, in a region of length 64 with 70% identity, I
in a one:-million-run simulation, Blast's consecutive weight :11 model has a 0.34 probability of having at least one hit in the range, while a nonconsecutive model of the same weight has a 0.466 probability of getting a hit, see Figure 1. On the other hand, the expected number of hits in that region by the Blast consecutive model is 1.07, while the noncon-secutive model expects 0.93 hits, about 14% less.
For convenience, denote a model by a 0-1 string, where the 1-positions represent required matches, while the Os are "don't cares". For example, if a weight 6 model 11101.11 is used, then actgact , v.s. acttact is a seed match, as well as actgact v.s. actgact. So for example, Blast uses models of the form l~. This invention proposes to use optimal seed models that maximize sensitivity while reducing random hits.
To evaluate a model, we compute its probabil-ity of generating a hit in a fixed length region of given similarity, by dynamic programming. Fig-ures 1, 3, and 2 compare a nonconsecutive models with Blast's consecutive models. For each similar-ity percentage shown on the x-axis, the percentage of length-64 regions acquiring at least 1 hit is plot-ted on the y-axis as the sensitivity at that similarity level in Figure 1.
. ' , ., ; ;, __ 9.9 , ",."., o.s r a o-s a °' 0.3 o.z ,;
o., __ o ~........ . _.~,_ ~ . . ___ 0.2 0.7 0.1 0.5 0.9 0.7 0.8 0.9 , GYMiflly over x,161 Figure 1: 1-hit performance of weight 11 spaced model versus weight 1:l and 10 consecutive mod-els, coordinates in logrithmic scale.
SENSEI uses a default seed size of 8; Fig-ure 2 compares its sensitivity with that of a spaced weight 9 model.
0.9 O.A
0.7 0.0 i 0.5 D.d 0.3 0, 0.2 0.3 O.d 0.5 O.G 0.7 0.8 0.9 1 iKIWerlfy Figure 2: 1-hit performance of weight 8 consecu-tive model versus weight 9 nonconsecutive model Theoretically the expected number of hits in a region can be easily calculated as in the following Lemma.
Lemma 1 The expected number of hits of a weight W, length M model within a length L re-gion of similarity 0 < p < 1, is (L - M + 1)pj't'.
Proof. The expected number of hits is the sum, over the (L - M + 1) possible positions of fitting the model within the region, of the probability of W specific matches, the latter being pi'i'. o Comments:
~ By Lemma 1, for a region length of 64, Blast seed of length 11, the expected num-ber of hits of a nonconsecutive seed of length 18 and weight 11 is about 14% lower than Blast, speeding up hit processing by the same amount (this is offset by the longer time needed to lookup a spaced seed). On the other hand, observing Figure 1, the spaced model al-ways has a significantly higher probability of at least one hit.

~ It has been brought to the inventors' at-tention, that a related but conceptually dif ferent approach of locally-sensitive hashing (LSH) (14J has been applied to ungapped ho-mology search in (15J. LSH is a random hashing/projection technique unsuitable for gapped homologies. In (15J, in each of hun-dreds of iterations, a newly chosen random hash function is applied to every region of a fixed size (of about 100), and regions mapping to the same value are fully compared. Simi-lar overlapping regions on the same diagonal are then merged into ungapped alignments.
Unlike Blast, a long ungapped alignment can only be found if the regions found to be sim-ilar cover its whole length. Retrospectively, our carefully chosen deterministic spaced seed model maximizes the chance of any HSP to contain at least one seed, while minimizing random hits. Experiments show that SENSEI
(8J (which is also limited to ungapped align-ment), at its default size 8 seed, is faster than LSH.
~ Several programs" including SENSEI, Exoner-ate, and Blastn, allow a mismatch in a con-secutive length k-seed matches. This idea was well-known before' current invention. The dif-ference between this old idea and the current invention is clear: the current invention maxi-mizes the sensitivity with an optimized spaced seed, while the previous approaches of SEN-SEI, Exonerate, Blastn and Buhler's LSH do not.
The current invention covers the use of one or more optimized as well as not-optimized pre-computed or randomly generated spaced seed model for the purpose of improving sensitivity of the search.

Claim 2. Double, 'I'riple, or k Hits Us-ing Spaced Model In order to improve selectivity, this invention pro-poses to apply the multiple hit method using the spaced model. 1.e., hits of the spaced seed are only extended if multiple ones occur close together on a single diagonal.
The idea of double hits is not new. The current 1.4 version of Blast triggers an extension if two dis-joint hits are found on the same diagonal within a certain distance (6~. The increased selectivity more than offsets the loss in sensitivity, so that it can use a smaller weight model and still generate fewer extensions than an equally sensitive 1-hit model of larger weight.
What is new here is that with spaced models, hits are no longer required to be disjoint in order to gain a lot of sensitivity. Figure 3 compares the sensitivity of a double hit spaced weight 11 model against single hit weight 11 and 12 consecutive models.
2nu mGmsotoi:~,p~f............
r.
W trir,ftttt -o.a i o.s s _ 0.4 0.2 0 . _-T--...--0.2 O.J 0.1 0.5 O.G 0.7 O.s 0.9 t cimilarily Figure 3: 2-hit performance of weight 11 spaced model versus single hit weight 11 and 12 consec-utive models.
Remarkably, Figures 1, 3, 2 show that the steeper curve of the spaced seed model has smaller hit probability in low similarity regions, with respect to the closest consecutive model in terms of sensi-tivity. This phenomenon further reduces unwanted hits in low-similarity regions. In fact, Figures 1 and 2 show that we can use a spaced model of weight 9 to replace a consecutive weight 8 model, gaining sensitivity above 64% similarity, or use a weight 11 spaced model to replace a weight 10 consecutive model gaining sensitivity above 60%. According to Lemma 1, the weight W~ spaced model has only a fraction p of the hits of the weight (W-1) consec-utive model over all p similarity regions. In the ad-mittedly artificial case where all such regions have 60% similarity and length 64, the length 18 weight spaced model has only 52% of the hits of the equally sensitive weight 11 consecutive model.
Claim 3. PatternHunter Method Steps PatternHunter was implemented in Java using the spaced seed model and various algorithmic im-provements using advanced data structures. Its key steps and inventions are decribed in the following.
Hit Generation PatternHunter uses a method for generating hits comparable to MegaBlast. For each position in one sequence, compute an index from fitting the model at that particular position. This index is 2*weight bits long (2 bits per base). Then do a lookup in a big table which gives the first position in the other sequence where the model matches.
This gives the first hit. Subsequent hits are found using another table, which for each position gives the next position where the model matches. This table requires one int (4 bytes) per base.
For each hit, PatternHunter looks up its diago-nal (position in second sequence minus positioin in first sequence) in another hashtable, the hit table, to find the rightmost matched position on that di-agonal. If this position is to the right of the hit then the hit is ignored as being part of an already found match.
If the double hit option is chosen then in absence of a recent hit on the same diagonal, the new one is merely recorded.
Alternatively, PatternHunter can trigger exten-sion on multiple hits on nearby diagonals, rather than on the exact same one. This is implemented efficiently by an additional 'banded' hit table, that stores hits by the band they occur in, each band consisting of some number R of consecutive diag-onals. Integer-dividing the diagonal of a hit by R
gives its band index. If there are 1~ hits all within R
diagonals of each other, then they necessarily occur within 2 adjacent bands, and this can be immedi-ately checked with the banded hit table.
Hit Extension Next this hit is extended in a greedy fashion to the left and right, stopping when the score drops by a certain amount. If the resulting segment pair has a score below a certain minimum, then it is ignored, else this is a Highscoring Segment Pair (HSP). The position of the last comparison, which reached the dropoff score, is stored in the hit table, so that future equivalent hits within this HSP can be recognized as redundant.
Gapping extension using shorter spaced seed models and the current HSP tree To find the best way to extend a HSP to the left across gaps, PatternHunter algorithms tries all can-didates from a diagonal-sorted set of recently found HSP, after adding to this set some new HSPs found by local hit generation. A variation of a red-black c tree is used to implement the set of HSPs sorted by diagonal. HSPs are inserted in the tree once an optimal gapped alignment to its left is found, and retired from the tree' once newly generated HSPs are too far beyond its right endpoint to make use of it. Retired alignments are put into a priority queue according to their scores.
The local hit generation finds triple hits of a smaller spaced model, such as 1101 or a similar bit string, in a limited length region to the left of the HSP and stores them in the tree if they have a certain minimum length.
For each candidate HSP to gap a newly found HSP to, the gapping cost is computed as the sum of the gap open plus gap extension penalties plus the cost of adjusting either HSP in size to make a perfect fit. From this data the best HSP, if any, to link to, is chosen and used to compute the optimal partial alignment scare. Overlapping alignments are not reported.
By default, PatternHunter allows a maximum gap length of 256, which can be done quite ef-ficiently with its diagonal ordered tree of recent HSPs, and often can be seen to make it use a single alignment where other programs output two sepa-rate ones.
The Achievement Of This Inven-tion Several test runs of PatternHunter with compari-son to other programs are reported here in order to demonstrate the power of this invention. Since the Blast family, especially the newly improved Blastn, is the industry standard, and widely recognized for its sensitivity (Blastn, SENSEI) and speed (Blastn, MegaBlast), comparison will be limited to these programs. All experiments are performed on a 700 MHz Pentium III PC with lGbyte of memory. The table in Figure 7 compares PatternHunter with the latest versions of Blastn and MegaBlast, down-loaded from the NCBI website. All programs were run without filtering (bl2seq option -F F) to en-sure identical input to the actual matching engines.
The table in Figure 8 compares PatternHunter with SENSEI; note that SENSEI, as currently available, does not do any gapped alignments. One may suspect that PatternHunter sacrifices quality for speed. Figures 4, 5, ~ show the opposite. In Figure 4, MegaBlast using seed weight 28 (MB28) misses over 700 high scoring alignments. Using the same parameters, PatternHunter outputs better results than Blastn, is 20 times faster and uses one tenth the memory, Figure 5. Notice the quick growth of Blastn/MegaBlast time/space requirements, indi-cating poor scalability. Only MegaBlast (MB28) at its default affine gap costs allowed further the comparison, without running out of memory, but with vastly inferior output quality compared to Pat-ternHunter (PH2), which uses only one fifth the time and one quarter the space, Figure 6. While MegaBlast is designed for high speed on highly sim-ilar sequences and Blastn for sensitivity, Pattern-Hunter simultaneously exceeds Blastn in sensitiv-ity, MegaBlast in speed {on long sequences), and both in memory use. Written in Java, it runs any genome anywhere.
Conclusion While particular embodiments of the present in-vention have been shown and described, it is clear that changes and modifications may be made to such embodiments without departing from the true scope and spirit of the invention. It is also clear that the present invention also applies to homol-ogy search in protein {amino acid) sequences.

.Y~.~ _.__T~__. , ' F,H~...

41!!I
~\
100 i.~\
w ,o ____. ~ --,o ,ao ,ooo ,ooa>
e~Yymxr~l rank Figure 4: Input: H. influenza and E. coli. Score is plotted as a function of the rank of the align-ment, with both axes logarithmic. MegaBlast (M828) misses over 700 alignments of score at least 100. M811 is MegaBlast with seed size 11 (50 times slower and 10 times more memory use than PH2), indicating the missed alignments by MB28 are mainly due too seed size.
The method steps of the invention may be em-bodied in sets of executable machine codes stored in a variety of formats such as object code or source code. Such code is described generically herein as programming code, or a computer program for sim-plicity. Clearly, the exe<:utable machine code may be integrated with the code of other programs, im-plemented as subroutines, by external program calls or by other techniques as known in the art.
The embodiment of 'the invention may be exe-cuted by a computer processor or similar device pro-grammed in the manner of method steps, or may be executed by an electronic system which is pro-vided with means for executing these steps. Simi-larly, an electronic memory medium such as com-puter diskettes, CD-Roms, Random Access Mem-ory (RAM), Read Only Memory (ROM) or simi-lar computer software storage media known in the art, may be programmed to execute such method ,ooo ,oo Figure 5: Input: H. in, f luenza and E. coli. Pat-ternHunter produces better quality output than Blastn while running 20 times faster.
steps. As well, electronic signals representing these method steps may also be transmitted via a com-munication network.
The invention could for example be applied to personal computers, super computers, main frames, application service providers (ASPs), In-ternet servers, smart terminals or personal digital assistants. Again, such implementations would be clear to one skilled in the art, and do not take away from the invention.

,o ,o ,oo ,ooo ,oooo ,ooooo ~ ------,- _ uza ---.: .,n ., ,oooo ~-__.;__~_,.._v ,aoo ,ao - -. . . -. - .. .. -, ,o ,oo ,ooo ,ooao ,ooooo Figure 6: Input: A. thaliana chr 2 and chr 4. PatternHunter (PH2) outscores MegaBlast in one sixth of the time and one quarter the mem-ory. Both programs used MegaBlast's non-affine gap costs (with gapopen 0, gapextend -7, match 2, and mismatch -6) to avoid MegaBlast from running out of memory. For comparison we also show the curve for MegaBlast with its default low complexity filtering on, which decreases its runtime more than sixfold to 3305 seconds.
..,\

References (1~ S.F. Altschul, W. Gish, W. Miller, E. My-ers, D.J. Lipman, Basic local alignment search tool. J. Mol. 8iol., 215, 403-410 (1990).
(2~ International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human genome. Nature 409, 860-921 {2001 ).
(3~ J.C. Venter et al, The sequence of the human genome. Science 291, 1304 {2001).
(4] W. Gish, WU-Blast website:
http://blast.wustl.edu.

(5] T.A. Tatusova, T.L. Madden, Blast 2 se-quences - a new tool for comparing protein and nucleotide sequences. FEMS Microbiol.
Lett. 174, 247-250 {1999).
(6] S.F. Altschul, et al. Gapped Blast and Psi- ( Blast: a new generation of protein database search programs. Nucleic Acids Research 25, 3389-3402 (1997).
(7] Z. Zhang, S. Schwartz, L. Wagner, W. Miller, A greedy Algorithm for Aligning DNA Se-quences. J. Comp. 8iol., ?:1-2, 203-214 (2000).
(8] D. States, SENSEI website:
http: //statesla b.wustl.ed a /software/sensei /
(9] D.J. Lipman, W.R. Pearson, Rapid and sensi-tive protein similarity searches. Science, 227, 1435-1441 (1985).
(10] A.l.. Delcher, S. Kasif, R.D. Fleischmann, J.
Peterson, 0. White, and S.L. Salzberg, Align-ment of whole genomes. Nucleic Acids Re-search 27:11, 2369-2376 (1999)..
(11] S. Burkhardt, A. Crauser, H-P. Lenhof, E. Ri-vals, P. Ferragina, M. Vingron, q-gram based database searching using a suffix array. 3rd Ann. International Conference on Computa-tional Molecular Biology, Lyon 11-14 April 1999.
(12] S. Kurtz, C. Schleiermacher, REPuter - Fast computation of maximal repeats in com-plete genomes. Bioinformatics, 15:5, 426-427 ( 1999).
(13] X. Huang and W. Miller, A Time-efficient, Linear-Space Local Similarity Algorithm. Ad-vances in Applied Mathematics 12, 337-357 (1991).

(14) P. Indyk, R. Motwani, Approximate nearest neighbors: tovuards removing the curse of di-mensionality. Proc. 30th .Ann ACM Symp.
Theory Comput., 1998, Dallas, TX.
(15) J. Buhler, Efficient large-scale sequence com-parison by locality-sensitive hashing. Bioinfor-matics, 1?, 419-428 (2001).

Seqll SizeSeq2 SizePH PH2 MB28 Blastn M. pneumoniae828KM. genitaliurn589KlOs/65M 4s/4$NI ls/88M 47s/451~I

E. calf 4.7MH. influenza1.8M34s/78M 14s/68M 5s/561M 716s/158NI

A.thaliana 19.6MA.thaliana 17.5M5020s/279M498s/231D-421720s/1087Mo0 chr 2 chr 4 H. Sapiens 35M H. Sapiens 26.2M14512s/419M5250s/417M- 00 00 chr 22 chr 21 Figure 7: Performance Comparison: If not specified, all with match 1, mismatch -1, gap open -5, ;;ap extension -1. PH denotes PatternHunter with seed weight 11, PH2 denotes same with double lit model (sensitivity similar to Blast's single hit size 11 seed, Figure 3) MB28 denotes MegaBla,st with default; seed size 28, and default affine gap penalties. Blastn (via BL2SE(a) uses default seed size 11. Table entries under PH, PH2, MB28 and Blagtn indicate time (seconds) and space (megabytes) used; oo means out of memory or segmentation fault.
Seql Size Seq2 SizePH(9) PH(11) SENSEI

E. colt 4.7M H. influenza1.8M279s/671V134s/78M 780s/64NI

A.thaliana 19.6MA.thaliana 17.5M677m/282M84m/279M781m/415M
chr 2 chr 4 Figure 8: F'atternHunter with seed weights 9,:11, 1-hit model vs SENSEI's weight 8 seed. SEN-SEI only does ungapped alignments. PatternHunter's weight 9 spaced seed has higher single-hit sensitivity i;han SENSEI's 8 as shown in Figure 2.

Claims (23)

1. The use of one or more, pre-computed or ran-domly generated, spaced seed model, opti-mized or not optimized, for the purpose of increasing sensitivity of the homology search.
2. The use of multiple-hit extension, together with spaced seed model to increase selectiv-ity while keeping relatively high sensitivity in homology search.
3. The use of a banded hit table to efficiently find multiple hits on nearby diagonals.
4. The gapped extension technique using local hit generation with a very short spaced seed and multiple-hit extension. (Currently Pat-ternHunter a spaced seed model of weight 3, with 3 hits triggering an extension.)
5. The use of an ordered tree data structure to store recently found HSPs sorted by diagonal, allowing for fast lookup of nearby HSPs which in turn allows optimal assembly of HSPs into alignments by dynamic programming.
6. PatternHunter can trigger extension not only on multiple hits on the same diagonal, but also on multiple hits on nearby diagonals.
7. The removal of "stale" HSPs from the or-dered tree data structure to keep its size small.
HSPs are considered stale when their ending position is more than some threshold away from the current seeding position.
8. A method of comparing character strings for a best match comprising the steps of: using a k non-consecutive characters as a seed.
9. A method of comparing a query sequence to a target sequence.
10. A method of sequence comparison comprising:
a deterministic-spaced seed model.
11. The method of previous item wherein said spaced model is optimized for maximum sen-sitivity.
12. The method of searching wherein bits of a spaced seed are only extended if multiple hits occur close together on the same diagonal or close.
13. A method of sequence comparison comprising:
a spaced seed model.
14. A method of finding matches of a sequence comprising the step s of: comparing a por-tion of a desired sequence to the contents of a sequence database.
15. Let us collectively call all above methods as PatternHunter technology. We not only claim PatternHunter technology for DNA sequences (which also include genomes, chromosomes, RNA sequences, cDNAs, short and long frag-ments), but also claim this method for ho-mology search in Protein (amino acid) se-quences. We claim a method for comparing protein sequences using PatternHunter tech-nology where the spaced seeds need to be shortened for sensitivity in the amino acid level.
16. We claim a method for finding similarity for any character string such as text documents, internet files, computer programs or image data, using the PatternHunter technology.
Clearly PatternHunter technology extends to comparing any character strings beyong DNA
or protein sequences.
17. A method of internet searching.
18. A method of data mining.
19. A method of detecting plagiarised documents.
20. A method of searching for matches between two text documents.
21. A system for implementing the method of any one of above claims.
22. An apparatus, especially a computer hardware circuit that implements the spaced model and the PatternHunter technology, for executing the method of any one of above claims.
23. A membory medium stroing code executable to perform the method of any of above claims.
CA 2357263 2001-09-07 2001-09-07 New methods for faster and more sensitive homology search in dna sequences Abandoned CA2357263A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA 2357263 CA2357263A1 (en) 2001-09-07 2001-09-07 New methods for faster and more sensitive homology search in dna sequences

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CA 2357263 CA2357263A1 (en) 2001-09-07 2001-09-07 New methods for faster and more sensitive homology search in dna sequences
US10236339 US20030120429A1 (en) 2001-09-07 2002-09-06 Method and system for faster and more sensitive homology searching
CA 2402048 CA2402048A1 (en) 2001-09-07 2002-09-09 Method and system for faster and more sensitive homology searching
US11561327 US9652586B2 (en) 2001-09-07 2006-11-17 Method and system for faster and more sensitive homology searching

Publications (1)

Publication Number Publication Date
CA2357263A1 true true CA2357263A1 (en) 2003-03-07

Family

ID=4169970

Family Applications (1)

Application Number Title Priority Date Filing Date
CA 2357263 Abandoned CA2357263A1 (en) 2001-09-07 2001-09-07 New methods for faster and more sensitive homology search in dna sequences

Country Status (2)

Country Link
US (2) US20030120429A1 (en)
CA (1) CA2357263A1 (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011137368A3 (en) * 2010-04-30 2012-02-16 Life Technologies Corporation Systems and methods for analyzing nucleic acid sequences
US9268903B2 (en) 2010-07-06 2016-02-23 Life Technologies Corporation Systems and methods for sequence data alignment quality assessment
CN102453751A (en) * 2010-10-19 2012-05-16 鼎生科技(北京)有限公司 Method for DNA sequencer to reattach short sequence to genome
US20120203792A1 (en) * 2011-02-01 2012-08-09 Life Technologies Corporation Systems and methods for mapping sequence reads
US9276911B2 (en) * 2011-05-13 2016-03-01 Indiana University Research & Technology Corporation Secure and scalable mapping of human sequencing reads on hybrid clouds
US9201916B2 (en) * 2012-06-13 2015-12-01 Infosys Limited Method, system, and computer-readable medium for providing a scalable bio-informatics sequence search on cloud
US9792405B2 (en) 2013-01-17 2017-10-17 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US20170270245A1 (en) 2016-01-11 2017-09-21 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
GB201508851D0 (en) * 2013-01-17 2015-07-01 Edico Genome Corp Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
WO2014186604A1 (en) * 2013-05-15 2014-11-20 Edico Genome Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US10068054B2 (en) 2013-01-17 2018-09-04 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US9679104B2 (en) 2013-01-17 2017-06-13 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US20140280317A1 (en) * 2013-03-15 2014-09-18 University Of Florida Research Foundation, Incorporated Efficient publish/subscribe systems
US9697327B2 (en) 2014-02-24 2017-07-04 Edico Genome Corporation Dynamic genome reference generation for improved NGS accuracy and reproducibility
US10006910B2 (en) 2014-12-18 2018-06-26 Agilome, Inc. Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same
US9857328B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Chemically-sensitive field effect transistors, systems and methods for manufacturing and using the same
US9618474B2 (en) 2014-12-18 2017-04-11 Edico Genome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10020300B2 (en) 2014-12-18 2018-07-10 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9859394B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
WO2016154154A3 (en) 2015-03-23 2018-04-26 Edico Genome Corporation Method and system for genomic visualization
US10068183B1 (en) 2017-02-23 2018-09-04 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on a quantum processing platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6892139B2 (en) * 1999-01-29 2005-05-10 The Regents Of The University Of California Determining the functions and interactions of proteins by comparative analysis

Also Published As

Publication number Publication date Type
US9652586B2 (en) 2017-05-16 grant
US20070088510A1 (en) 2007-04-19 application
US20030120429A1 (en) 2003-06-26 application

Similar Documents

Publication Publication Date Title
Bayer et al. Scalable, behavior-based malware clustering.
Simpson et al. ABySS: a parallel assembler for short read sequence data
Promponas et al. CAST: an iterative algorithm for the complexity analysis of sequence tracts
Peterson et al. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species
Rost Protein secondary structure prediction continues to rise
Udall et al. A global assembly of cotton ESTs
Brudno et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA
Gish et al. Identification of protein coding regions by database similarity search
Chevreux et al. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs
Schröder et al. SHREC: a short-read error correction method
Eddy Profile hidden Markov models.
Yang et al. A survey of error-correction methods for next-generation sequencing
Rigoutsos et al. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm.
Boratyn et al. Domain enhanced lookup time accelerated BLAST
Price et al. De novo identification of repeat families in large genomes
Chou et al. DNA sequence quality trimming and vector removal
Pavesi et al. An algorithm for finding signals of unknown length in DNA sequences
Lomsadze et al. Gene identification in novel eukaryotic genomes by self-training algorithm
Wasmuth et al. prot4EST: translating expressed sequence tags from neglected genomes
Uzilov et al. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change
Brent Genome annotation past, present, and future: how to define an ORF at each locus
A. Schäffer et al. IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices
Saha et al. Empirical comparison of ab initio repeat finding programs
Orengo et al. Bioinformatics: genes, proteins and computers
Bystroff et al. HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins1

Legal Events

Date Code Title Description
FZDE Dead