WO2003081507A1 - Method for filtering nucleic or proteic sequence data - Google Patents

Method for filtering nucleic or proteic sequence data Download PDF

Info

Publication number
WO2003081507A1
WO2003081507A1 PCT/IB2002/002019 IB0202019W WO03081507A1 WO 2003081507 A1 WO2003081507 A1 WO 2003081507A1 IB 0202019 W IB0202019 W IB 0202019W WO 03081507 A1 WO03081507 A1 WO 03081507A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
sequences
gram
count
hht
Prior art date
Application number
PCT/IB2002/002019
Other languages
French (fr)
Inventor
Eric Glemet
Original Assignee
Gene-It
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gene-It filed Critical Gene-It
Priority to AU2002309085A priority Critical patent/AU2002309085A1/en
Priority to PCT/IB2002/002019 priority patent/WO2003081507A1/en
Publication of WO2003081507A1 publication Critical patent/WO2003081507A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the present invention generally relates to nucleic and proteic sequence comparison and classification systems and methods. Background of the invention a) Generalities '
  • Nucleic and proteic sequences comparison and classification is one of the oldest and more persistent problem in computational biology. Many metrics and similarities have been defined according to the biological underlying problem.
  • the Initial Clustering Similarity is the basic similarity used in most systems and methods performing initial clustering of sequences.
  • Figure 1 of the appended drawings shows five ESTs Si to S 5 derived from a messenger RNA sequence designated as mRNA, with supposedly overlapping regions OR1 to OR4 shown in dark.
  • two ESTs Si and S 2 are considered similar according to the ICS criterion if they share a window of length L with at least a percentage P of identical members.
  • the window length is generally chosen equal to about 150 and the percentage P equal to about 96%.
  • the user can adapt the length L and the percentage P as a function of the particularities of the set of sequences to be processed.
  • d2_cluster A Validated Method for Clustering EST and Full- Length cDNA Sequences
  • this known method has a number of drawbacks, and the teachings of this paper have a number of limitations.
  • this method is limited to the comparison between sequences inside a single EST set. This means that it cannot be used for detecting similarities between sequences belonging to two different sets, which is a strong limitation in view of the current expectations from the market.
  • the d2 algorithm is devoted to compute a measure between two biological sequences and is hardly related to a similarity detection based on a window of length L and a percentage of identities P.
  • the one skilled in the art would not be capable of practically implementing the d2 algorithm for performing (L, P) comparisons.
  • this known method use of a heavy indexing structure, i.e. a suffix array, which requires to be preprocessed on the whole starting database and cannot be practically performed "on the run".
  • a suffix array is valid only as long as the database contents does not change, which is not realistic: as this heavy preprocessing would have to be done each time the database is changed, then the computer system would spend most of its processing time, or in any case a huge part of its resources, for building an updated suffix array.
  • Another drawback of this known system is that, when it is applied to a very long sequence (typically a genome) split into smaller blocks, then a predetermined, longest possible overlap between blocks has to be provided, so as to avoid in any case that an actually pertinent sequence might remain undetected as being split among two consecutive blocks (this will be explained in greater detail in the description).
  • a first object of the present invention is to provide a method for detecting similar sequences of characters such as nucleic or proteic sequences which can operate on a sequence-against-sequence basis while benefiting from significant gain in terms of processing time and required memory space based on a block processing of a plurality of sequences.
  • Another object of the present invention is to provide a method that does not rely upon a preprocessed heavy suffix array or equivalent, i.e. that can be directly implemented on any selected set of sequences, and even on a global database, provided that enough working memory space is available, and that can be run efficiently even if changes occur in the contents of the database.
  • Another object of the present invention is to allow a practical and efficient implementation of a clustering method involving a similarity determination based on the existence in a sequence pair of a common window of length L with a percentage P of identical characters.
  • Still another object of the present invention is to decrease as far as possible the number of required elementary filtering operations by benefiting from the properties of cluster merging.
  • Another object of the present • invention when dealing with long sequences (such as genomic sequences) split into blocks, is to allow an adjustment of the overlap between blocks to decrease the number of elementary filtering operations to what is strictly needed and to avoid double matching report when there is only one pertinent sub-sequence in the long sequence.
  • the present invention provides a method for filtering a plurality of nucleic or proteic sequences Ei, ..., E e (SetE) and/or Si, ..., S s (SetS) for the purpose of subsequent sequence similarity determination, each position in each sequence belonging to a predetermined alphabet __, and said sequences being stored in a computer memory, the method comprising the following steps:
  • steps (e) to (i) are performed for a plurality of second sequences (S y ) with use of the same count sets (HHT[i]) as constructed and stored in step (d), so that the processing resources and memory space used for count sets HHT[i] are saved.
  • said second sequence S y belongs to a set of sequences Si, ..., S s (SetS) which is identical to said set of first sequences Ei, ..., E e
  • steps (g2), (h) and (i) are omitted for a first sequence E z which is identical to the second sequence S y and for any first sequence E z which has already been used as second sequence S y in steps (g2), (h) and (i). This avoids useless work consisting in filtering a given sequence against itself of against another sequence when said another sequence has already been filtered against said given sequence.
  • the present invention provides a method for determining the similarities between a plurality of nucleic or proteic sequences Ei, ..., E e (SetE) and/or Si, ..., S s (SetS), each position in each sequence belonging to a predetermined alphabet __, and said sequences being stored in a computer memory, the method comprising the following steps:
  • steps (e) to (j) are performed for a plurality of second sequences (S y ) with use of the same count sets (HHT[i]) as constructed and stored in step (d).
  • said similarity determination comprises the following steps:
  • said plurality of said first sequences E z may be constituted by overlapping blocks B j of the long sequence G, and wherein, after at least two steps (i) have accepted as candidates a given second sequence S and at least two first sequences B j , B j+lj ... which are contiguous blocks in long sequence G, step (j) is performed on the second sequence S and to a supersequence SB constituted by the first sequences B j , B j+ i, ... re-concatenated with elimination of overlap L.
  • the method comprises an initial step in which a user selects a sub-sequence length L for similarity determination, and a step of constructing said overlapping blocks with an overlap length L equal to the sub-sequence length.
  • the present invention provides a method for clustering a plurality of nucleic or proteic sequences Ei, ..., E e (SetE) and/or Si, ..., S s (SetS), each position in each sequence belonging to a predetermined alphabet _ , said sequences being stored in a computer memory, the method comprising the following steps: (a) determining a word length q having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet __ a set of possible words (q-grams) Wj having length q,
  • step (d) constructing and storing in said working memory, in association with each q-gram w,, a respective count set HHT[i] containing, for each member of the count set, an identifier z of a first sequence E z having at least one occurrence of the q-gram and an first occurrence count k(i,z) of the q-gram in sequence E z , as determined in step (c),
  • Steps (e) to (k) are preferably performed for a plurality of second sequences (S y ) with use of the same count sets (HHT[i]) as constructed and stored in step (d), so that the processing resources and memory space used for count sets HHT[i] are saved.
  • this method additionally comprises the following steps: (1) identifying possible other sequences which where previously sharing a common cluster attribute with either of the sequences determined as similar, and
  • the method further comprises, before step (j), the steps of checking whether the similarity candidates have a same cluster attribute, if not, performing step (j), if so, skipping step (j) and passing to the next first sequence E z or second sequence S y .
  • step (j) the steps of checking whether the similarity candidates have a same cluster attribute, if not, performing step (j), if so, skipping step (j) and passing to the next first sequence E z or second sequence S y .
  • Figure 1 illustrates a plurality of sequences derived from a mRNA strand and showing similarity windows
  • Figure 2 illustrates in greater detail the existence of a similarity window between two sequences
  • Figure 3 shows the structure and contents of an exemplary hash table used in the filtering method of the present invention
  • Figures 4a and 4b illustrate a chunk mode of filtering according to the present invention, in bank-against-bank and in bank-against-itself modes, respectively,
  • Figure 5 shows the structure and contents of a huge hash table used in the filtering method of the present invention
  • Figure 6 shows the main steps of the filtering method of the present invention based on the tables shown in Figure 3 and Figure 5,
  • FIG. 7 illustrates the basic operation of the present invention for clustering sequences
  • Figure 8 illustrates the cutting of a long sequence into blocks to allow filtering according to the present invention
  • FIG. 9 illustrates an optional step of the present invention in the context of Figure 8.
  • Figure 10 illustrates overlap issues when cutting a long sequence.
  • Si and S 2 be two sequences of characters, each character belonging to an alphabet __.
  • the Hamming distance between Si and S 2 denoted H(S ⁇ , S 2 ), is the minimal number of substitutions needed to transform Si into S 2 .
  • G q (S) is the set of q-grams (words of length q) of S defined as :
  • G q (S) ⁇ W J - Si...s i+q _ ⁇ , 1 ⁇ i ⁇ n-q-1 ⁇
  • this set is empty when n ⁇ q, i.e. when the sequence length is shorter than the q-gram length.
  • ICS(S ⁇ , S 2 ) The Initial Clustering Similarity between two sequences Si and S 2 , denoted ICS(S ⁇ , S 2 ), is defined as :
  • ICS L (S 1 , S2) min ⁇ (H(w ⁇ , w 2 )
  • a suitable algorithm is required for verifying whether two ESTs Si and S 2 are k-similar with respect to a given window size L.
  • a suitable filtering method is based on q-grams filtering.
  • a q-gram is a merely a sequence or "word" of length q.
  • the filtering method consists in counting the number of common q-grams between two sequence sequences, i.e. the number of q- grams that are present in both sequences.
  • N q is determined using the following lemma :
  • Si and S 2 be k-similar sequences, i.e.
  • a hash table HT q (S) associates to every q- grams in a sequence S its number of occurrences, i.e. its multiplicity, in said sequence.
  • the value q is selected small enough to consider a complete hash table without redundancy.
  • the complexity of such hash table is of the order 0(
  • the size of the computer memory for storing such hash table is equal to ⁇ q , where ⁇ is the size of the alphabet _ .
  • the filtering operation on two sequences Si and S 2 is done by the following steps :
  • an initialization of the hash table for sequence Si is done for once, at the beginning of the whole process, with an order of complexity equal to 0( ⁇ q ) ; a second table I q (S ⁇ ) also having an order of complexity 0( ⁇ q ) is also initialized; this table is intended to store a list of indices of the hash table HT q (S ⁇ ), i.e. the positions in table HT q (S ⁇ ) where there is a strictly positive count value;
  • sequence Si is parsed and all its q-grams are considered in sequence ; each time a new q-gram, hashed to index i, is discovered, index i is added in the list of positions I q (S ⁇ ) ; the contents of the cell of the hash table corresponding to position i, denoted HT q (S ⁇ )[i], is incremented so as to indicate that another such q-gram has been identified in sequence Si ; it should be noted here that this step (ii) has an order of complexity of 0(
  • Steps (i) and (ii) are repeated for the second sequence S 2 , so as to obtain
  • HT q (S 2 ) storing the multiplicities of the q-grams of s 2 ;
  • - a table I q (S 2 ) that stores the marked indices in HT q (S 2 ) ;
  • a variable count of a counter is then initialized to 0 ; this variable count aims at storing the number of common q-grams between sequences Si and S 2 ; then, for each index i contained in I q (S ⁇ ) which is also present in I q (S 2 ), the variable count is incremented with the smallest of values HT q (S])[i] and HT q (S 2 )[i] ; at the end of this process, the value of count is the number of common q-grams between sequences Si and S 2 ; the time complexity of this step is equal to 0(
  • the above filter steps have a time and space complexity 0( ⁇ q ) for the pre-processing time and a complexity 0(
  • FIG. 3 A simplified example of the above filtering process is shown in Figure 3 of the appended drawings.
  • An exemplary EST sequence Si is equal to "ACGTACACG".
  • the value of q is 2. All sixteen possible q-grams are shown in the Figure.
  • the contents of the one-dimension table HT 2 (S ⁇ ) is also shown, as well as the contents of table l 2 (S ⁇ ).
  • This is a variable length, one-dimension table indicating as a list of the positions of hash table HT 2 (S ⁇ ) having a non-zero value. For instance, the contents of memory cell HT 2 (S ⁇ )[5] is "1" in the shown example.
  • a first "basic” approach to this problem would be to use the above- described filter in sequence for each pair (S;, E j ).
  • K s
  • PF (e.
  • + K e ) e.K s + s.K e at running time, using each time 0( ⁇ q ) memory space.
  • An improvement to this basic approach would consist in performing the hash processing for once on each sequence E j of sequence set SetE, and then to compare each S; of sequence set SetS against it (or conversely hashing each S ; for once, and comparing each E j against it).
  • An application of the present invention is the clustering of a large bank of ESTs against itself or against another bank of ESTs. Typically, such sets can be as large as 3 billions of sequences.
  • the above-described improvement would indeed accelerate pairwise comparisons, but its worst case complexity PF would also be its average complexity, since every new sequence Sj is hashed every time, which costs 0(jSi
  • an improvement according to an aspect of the present invention consists in performing filtering between an EST Sj belonging to a sequence set SetS containing s ESTs Si, S 2 , S 3 , ..., S s and a preprocessed set SetE containing e ESTs Ei, E_, E 3 , ..., E e .
  • the set SetE is called a "chunk" in the following.
  • Figure 4a diagrammatically illustrates the whole comparison workflow, where it can be seen that a first bank Bi of sequences Ej, E 2 , ... E e is subdivided into a plurality of chunks SetEj, SetE 2 , etc., while each sequence Si, S 2 , ..., S s in a second bank B 2 is individually used as a query sequence against each chunk, as shown.
  • the multiplicities of q-grams of the sequences contained in a given chunk SetE are practically determined at the chunk level by building in a working memory a Huge Hash Table HHT.
  • hash table associates to each possible q-gram a list of the ESTs contained in SetE in which this q-gram appears, together with its number of occurrences (i.e. its multiplicity) inside each concerned sequence.
  • the size of this table HHT is equal at most to e. ⁇ q , where e is the number of sequences in chunk SetE.
  • a sequence S is compared to the contents of the HHT table by using a one-dimension table D having a size e, where e is the number of sequences in chunk SetE, each cell being initialized to 0 everywhere.
  • the sequence S is parsed and the hash table HT q (S) and index table I q (S) as described in the foregoing are built.
  • the process goes through the list of couples of values (z, k(i, z)) contained in a portion HHT[i] of table HHT, where z is the order the sequence in chunk SetE (i.e.
  • the loading of cells D[z] in table D is incremential, i.e. as soon as a new value is to be loaded in a given cell D[z], it is added to the value previously stored in the same cell.
  • each value D[z] in table D represents the number of common q-grams between sequences S and E z , for all values of z comprised between 1 and e.
  • table HHT for the chunk SetE appears in Figure 5.
  • a memory cell HHT[i, z] is filled in table portion HHT[i] each time a sequence E z having at least one such 2-gram inside it.
  • the 2-gram "AA” is found three times in sequence E2 and once in sequence E4, so that two corresponding cells are filled under the "AA" 2-gram position.
  • sequences S1-S4 are identical to sequences E1-E4, respectively, which reflects a "bank-against-itself filtering.
  • table HHT has been built as described above
  • sequence S j is compared to table HHT in Step 1.
  • hash table HT (S ⁇ ) and index table I 2 (S ⁇ ) are built.
  • the sequence Si being the same as for the example shown in Figure 3, these tables are identical to those shown in Figure 3.
  • table D will have the following contents: ⁇ X, 2, 2, 3 ⁇ (X means that the contents is not significant, as it would correspond to a sequence-against-itself filtering).
  • S 3 E 3 is then used as query sequence in step 3 and will lead to a table D having the following contents: ⁇ X, X, X, 1 ⁇ .
  • Figure 6 also illustrates an application of the above-described filtering process to the clustering of sequences Si, ..., S 4 here again considered as identical to sequences Ei, ..., E 4 . It is supposed that each of these sequences originally belongs to a different original cluster, respectively CLj, CL 2 , CL and CL 4 , as illustrated in Figure 6.
  • N q i.e. the threshold for the filtering process
  • every cell of every table D which will contain a number at least equal to 2 will reveal that the considered sequence couple passes the filter.
  • the k-similarity determination process is based on the well known "BLAST” method, which is described for instance in the paper "Basic Local Alignment Search Tool", cf. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool", Journal of Molecular Biology, 215:403-410, 1990).
  • the present invention preferably uses a slightly modified implementation of BLAST method which is known per se: practically, the process comprises performing the BLAST method among the sequence pairs which have passed the q-gram filtering, so as to provide a fast search of the longest identity regions shared by each sequence pair. A mere checking of the BLAST output is then performed so as to isolate, in such regions, at least one region which satisfies the ICS determination with given parameters L and P (cf. supra).
  • Figure 7 of the appended drawings illustrates the global mechanism for the comparison of two ESTs S and E belonging respectively to two different sets of sequences SetS and SetE, where each set initially forms a respective cluster of sequences.
  • sequences S and E are subjected to a clustering process in two main steps:
  • a so-called “single link” clustering is performed, i.e. if two ESTs already belong to the same cluster by transitivity, the actual similarity test between both is not performed.
  • the present invention can easily be adapted to map ESTs on a genome G.
  • the basic principle is to split the genome G in blocks having a given length d that overlap with o bases, and to consider all these blocks as the queries for the filtering process.
  • Figure 8 shows such a split chromosome.
  • the overlap is chosen at least equal to the window length L selected by the operator at the beginning of the process and used in the FWSB k- similarity process. This will be further explained in the following.
  • the genome G is firstly split into blocks Bi, ..., B m of length d (except last block B m which may have a different length, d being chosen greater than the length of the longest EST in the set of sequences SetS to which genome G is to be compared, with an overlap of L bases, L being the length of the similarity window;
  • a huge hash table HHT is then built for the set of blocks Bi, ..., B m obtained by this process; - all the ESTs of SetS are individually compared to the blocks Bi, ..., B m by parsing table HHT as described above; - however, an important feature of the inventive process is that, if a given EST Sj of SetS passes through the filter with at least two contiguous blocks, e.g. with three blocks B j , B j+ i and B j+2 , then the k-similarity determination is performed between EST Si and a "superblock" SB which is the concatenation said contiguous blocks, i.e. B j , B j -n and B j+2 in the present species, with elimination of the overlap(s);
  • a genome can be split into blocks "on the run" with an overlap preferably equal to the window length L selected by the user as a parameter, which allows different, optimal genome splittings depending on the value of L; - secondly, by concatenating again the blocks which have passed the filtering process with a given query sequence, as said blocks initially were, erroneous double matchings are avoided, as matching portions located at the overlap between two blocks will be merged again into a single matching portion.
  • filtering and similarity processes can be easily parallelized in a multi-processor environment, whether in a shared or distributed memory architecture.
  • Table II indicates the speed-up achieved by parallel processing in the practical example of a set of 100000 ESTs compared against itself, with each EST having a length greater than 800 and the average length being 927.
  • the computer environment was an AlphaServer ES40/466 with four 666 MHz processors, as available at Compaq.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides filtering of nucleic or proteic sequences for the purpose of similarity determination. The subsequent similarity determination is used e.g. for sequence clustering.

Description

METHOD FOR FILTERING NUCLEIC OR PROTEIC SEQUENCE DATA
The present invention generally relates to nucleic and proteic sequence comparison and classification systems and methods. Background of the invention a) Generalities '
Nucleic and proteic sequences comparison and classification is one of the oldest and more persistent problem in computational biology. Many metrics and similarities have been defined according to the biological underlying problem.
The initial comparison of expressed sequence tag (hereinafter ""EST" or "sequence" or "sequence data") sequences for the purpose of generating "clusters" of sequences having a given degree of similarity is based on the fact that such sequences are suppose to strongly overlap.
The Initial Clustering Similarity (ICS) is the basic similarity used in most systems and methods performing initial clustering of sequences. Figure 1 of the appended drawings shows five ESTs Si to S5 derived from a messenger RNA sequence designated as mRNA, with supposedly overlapping regions OR1 to OR4 shown in dark. Turning now to Figure 2, two ESTs Si and S2 are considered similar according to the ICS criterion if they share a window of length L with at least a percentage P of identical members. Typically, if ESTs having each a length about 300- 400 nucleic base pairs is considered, the window length is generally chosen equal to about 150 and the percentage P equal to about 96%. In most known systems, the user can adapt the length L and the percentage P as a function of the particularities of the set of sequences to be processed.
A previous EST clustering method known as "d2" is the subject matter of paper "d2_cluster: A Validated Method for Clustering EST and Full- Length cDNA Sequences", J.Burke et al., Genome Research, 9(11):1135-1142, 1999).
However, this known method has a number of drawbacks, and the teachings of this paper have a number of limitations. First of all, this method is limited to the comparison between sequences inside a single EST set. This means that it cannot be used for detecting similarities between sequences belonging to two different sets, which is a strong limitation in view of the current expectations from the market.
In addition, this prior document is far from being clear concerning how to determine whether two ESTs are similar or not. More particularly, the authors of this document assert that the method is capable of verifying a possible overlap of length L with a certain percentage of identities P using the d2 method, and refer in this regard to a prior paper "Biological evaluation of d2, an algorithm for high-performance sequence comparison". . Hide et al. Journal of Computational Biology, 1 :199- 215, 1994.
However, in this prior paper, the d2 algorithm is devoted to compute a measure between two biological sequences and is hardly related to a similarity detection based on a window of length L and a percentage of identities P. As a consequence, the one skilled in the art would not be capable of practically implementing the d2 algorithm for performing (L, P) comparisons.
Last but not least, it can be in any case understood from the d2 prior art that the actual complexity of the computations will proportional to the square of the sum of all the lengths of the sequences contained in the starting set, which leads to impracticable execution times even with state of the art computer systems .
Therefore, these prior art references do not provide or even suggest a practical (L, P) similarity determination.
Besides, and turning now to issues of computing capabilities, attempts have been made recently for decreasing the computing work required for pair-to-pair comparison of sequences. The paper "q-gram Based Database Searching Using a Suffix Array (QUASAR)", S. Burkhardt et al. RECOMB 99, Third Annual International Conference on Computational Molecular Biology discloses a method for determining similarity between a query sequence and sequences contained in a database based on the principle of counting occurrences of short words (hereinafter "q-grams", i.e. words having a length of q characters) in the query sequence and in the database, and detecting similarity if the count of q-grams in common exceeds a limit. This known method, although it allows much faster processing and requires less working memory than other algorithms such as basic local alignment algorithms, however has a number of drawbacks.
First of all, this known method use of a heavy indexing structure, i.e. a suffix array, which requires to be preprocessed on the whole starting database and cannot be practically performed "on the run". This means that the suffix array is valid only as long as the database contents does not change, which is not realistic: as this heavy preprocessing would have to be done each time the database is changed, then the computer system would spend most of its processing time, or in any case a huge part of its resources, for building an updated suffix array.
Another drawback of this known system is that, when it is applied to a very long sequence (typically a genome) split into smaller blocks, then a predetermined, longest possible overlap between blocks has to be provided, so as to avoid in any case that an actually pertinent sequence might remain undetected as being split among two consecutive blocks (this will be explained in greater detail in the description).
However, the systematic use of a long overlap, and in particular an overlap much longer than necessary, will result in: (a) performing many unnecessary q-gram count comparison against blocks, (b) generating two different matchings while there is only one occurrence of a pertinent sequence, as this pertinent sequence, if contained in the overlap, will appear in two different blocks. Summary of the invention
A first object of the present invention is to provide a method for detecting similar sequences of characters such as nucleic or proteic sequences which can operate on a sequence-against-sequence basis while benefiting from significant gain in terms of processing time and required memory space based on a block processing of a plurality of sequences.
Another object of the present invention is to provide a method that does not rely upon a preprocessed heavy suffix array or equivalent, i.e. that can be directly implemented on any selected set of sequences, and even on a global database, provided that enough working memory space is available, and that can be run efficiently even if changes occur in the contents of the database.
Another object of the present invention is to allow a practical and efficient implementation of a clustering method involving a similarity determination based on the existence in a sequence pair of a common window of length L with a percentage P of identical characters.
Still another object of the present invention, is to decrease as far as possible the number of required elementary filtering operations by benefiting from the properties of cluster merging.
Another object of the present invention, when dealing with long sequences (such as genomic sequences) split into blocks, is to allow an adjustment of the overlap between blocks to decrease the number of elementary filtering operations to what is strictly needed and to avoid double matching report when there is only one pertinent sub-sequence in the long sequence.
According to a first aspect, the present invention provides a method for filtering a plurality of nucleic or proteic sequences Ei, ..., Ee (SetE) and/or Si, ..., Ss (SetS) for the purpose of subsequent sequence similarity determination, each position in each sequence belonging to a predetermined alphabet __, and said sequences being stored in a computer memory, the method comprising the following steps:
(a) determining a word length q having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet __ a set of possible words (q-grams) wj having length q,
(b) storing in a working memory a plurality of first sequences
{Ez},
(c) for each q-gram Wj and for each sequence Ez, counting the number of occurrences of the q-gram in the sequence, (d) constructing and storing in said working memory, in association with each q-gram wi5 a respective count set HHT[i] containing, for each member of the count set, an identifier z of a first sequence Ez having at least one occurrence of the q-gram and an first occurrence count k(i,z) of the q-gram in sequence Ez, as determined in step (c),
(e) counting the number of occurrences of each q-gram wj in a second sequence Sy for generating second occurrence counts HTq(Sy)[i], (f) storing in the working memory the second occurrence counts HTq(Sy)[i] as determined in step (e) as well as a set Iq(Sy) of identifiers i of q-grams W; for which the counts HTq(Sy)[i] are strictly positive,
(g) for each q-gram identifier i contained in set Iq(Sy): (gl) reading the count set HHT[i] constructed in step (d) in relation with the q-gram Wj,
(g2) for each sequence Ez identified in count set HHT[i], determining the minimum value between the corresponding first occurrence count k(i,z) and the second occurrence count HTq(Sy)[i],
(h) for each sequence Ez, determining and storing a common q- gram score D[z] with said second sequence Sy based on the minimum values determined in step (g2) for said first sequence, and
(i) rejecting sequence Ez and sequence Sy as similarity candidates for each sequence Ez for which score D[z] is lower than a given threshold value.
Advantageously, steps (e) to (i) are performed for a plurality of second sequences (Sy) with use of the same count sets (HHT[i]) as constructed and stored in step (d), so that the processing resources and memory space used for count sets HHT[i] are saved.
Preferably, when said second sequence Sy belongs to a set of sequences Si, ..., Ss (SetS) which is identical to said set of first sequences Ei, ..., Ee
(SetE), then steps (g2), (h) and (i) are omitted for a first sequence Ez which is identical to the second sequence Sy and for any first sequence Ez which has already been used as second sequence Sy in steps (g2), (h) and (i). This avoids useless work consisting in filtering a given sequence against itself of against another sequence when said another sequence has already been filtered against said given sequence.
According to a second aspect, the present invention provides a method for determining the similarities between a plurality of nucleic or proteic sequences Ei, ..., Ee (SetE) and/or Si, ..., Ss (SetS), each position in each sequence belonging to a predetermined alphabet __, and said sequences being stored in a computer memory, the method comprising the following steps:
(a) determining a word length q having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet __ a set of possible words (q-grams) W; having length q, (b) storing in a working memory a plurality of first sequences
(E ,
(c) for each q-gram Wj and for each sequence Ez, counting the number of occurrences of the q-gram in the sequence,
(d) constructing and storing in said working memory, in association with each q-gram Wj, a respective count set HHT[i] containing, for each member of the count set, an identifier z of a first sequence Ez having at least one occurrence of the q-gram and an first occurrence count k(i,z) of the q-gram in sequence Ez, as determined in step (c), (e) counting the number of occurrences of each q-gram Wj in a second sequence Sy for generating second occurrence counts HTq(Sy)[i], (f) storing in the working memory the second occurrence counts HTq(Sy)[i] as determined in step (e) as well as a set Iq(Sy) of identifiers i of q-grams w; for which the counts HTq(Sy)[i] are strictly positive,
(g) for each q-gram identifier i contained in set Iq(Sy): (gl) reading the count set HHT[i] constructed in step (d) in relation with the q-gram W;,
(g2) for each sequence Ez identified in count set HHT[i], determining the minimum value between the corresponding first occurrence count k(i,z) and the second occurrence count HTq(Sy)[i],
(h) for each sequence Ez, determining and storing a common q- gram score D[z] with said second sequence Sy based on the minimum values determined in step (g2) for said first sequence,
(i) accepting sequence Ez and sequence Sy as similarity candidates for each sequence Ez for which the common q-gram score D[z] is higher than a given threshold value, and
(j) for each accepted pair of sequences (Ez, Sy), performing a similarity determination.
It is also preferred here that steps (e) to (j) are performed for a plurality of second sequences (Sy) with use of the same count sets (HHT[i]) as constructed and stored in step (d).
In a preferred embodiment, said similarity determination comprises the following steps:
(k) determining a sub-sequence length L lower than the lengths of the sequences Ez, Sy, (1) determining a percentage P of identities,
(m) determining whether the two sequences Ez, Sy share a common sub-sequence having length L and having a percentage P of identical characters.
In the case of very long sequences such as genomes, said plurality of said first sequences Ez may be constituted by overlapping blocks Bj of the long sequence G, and wherein, after at least two steps (i) have accepted as candidates a given second sequence S and at least two first sequences Bj, Bj+lj ... which are contiguous blocks in long sequence G, step (j) is performed on the second sequence S and to a supersequence SB constituted by the first sequences Bj, Bj+i, ... re-concatenated with elimination of overlap L.
According to another preferred feature, the method comprises an initial step in which a user selects a sub-sequence length L for similarity determination, and a step of constructing said overlapping blocks with an overlap length L equal to the sub-sequence length.
According to a further aspect, the present invention provides a method for clustering a plurality of nucleic or proteic sequences Ei, ..., Ee (SetE) and/or Si, ..., Ss (SetS), each position in each sequence belonging to a predetermined alphabet _ , said sequences being stored in a computer memory, the method comprising the following steps: (a) determining a word length q having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet __ a set of possible words (q-grams) Wj having length q,
(b) storing in a working memory a plurality of first sequences
(E2), (c) for each q-gram w; and for each sequence Ez, counting the number of occurrences of the q-gram in the sequence,
(d) constructing and storing in said working memory, in association with each q-gram w,, a respective count set HHT[i] containing, for each member of the count set, an identifier z of a first sequence Ez having at least one occurrence of the q-gram and an first occurrence count k(i,z) of the q-gram in sequence Ez, as determined in step (c),
(e) counting the number of occurrences of each q-gram Wj in a second sequence Sy for generating second occurrence counts HTq(Sy)[i], (f) storing in the working memory the second occurrence counts
HTq(Sy)[i] as determined in step (e) as well as a set Iq(Sy) of identifiers i of q-grams wj for which the counts HTq(Sy)[i] are strictly positive, (g) for each q-gram identifier i contained in set Iq(Sy):
(gl) reading the count set HHT[i] constructed in step (d) in relation with the q-gram $,
(g2) for each sequence Ez identified in count set HHT[i], determining the minimum value between the corresponding first occurrence count k(i,z) and the second occurrence count
HTq(Sy)[i], (h) for each sequence Ez, determining and storing a common q- gram score D[z] with said second sequence Sy based on the minimum values determined in step (g2) for said first sequence, (i) accepting sequence Ez and sequence Sy as similarity candidates for each sequence Ez for which said common word count is higher than a given threshold value,
(j) for each accepted pair of sequences (Ez, Sy), performing a similarity determination, and
(k) for each pair of sequences (Ez, Sy) determined as similar, attaching to said sequences a common cluster attribute.
Steps (e) to (k) are preferably performed for a plurality of second sequences (Sy) with use of the same count sets (HHT[i]) as constructed and stored in step (d), so that the processing resources and memory space used for count sets HHT[i] are saved.
Advantageously, this method additionally comprises the following steps: (1) identifying possible other sequences which where previously sharing a common cluster attribute with either of the sequences determined as similar, and
(m) attaching to said other sequences said common cluster attribute. further comprising the following steps:
In addition, the method further comprises, before step (j), the steps of checking whether the similarity candidates have a same cluster attribute, if not, performing step (j), if so, skipping step (j) and passing to the next first sequence Ez or second sequence Sy. This avoid useless similarity determination e.g. when the purpose of the method is to merge two clusters as soon as two sequences respectively belonging to these two clusters are reported as similar.
Brief description of the drawings
Other aims, aspects and advantages of the present invention will appear more clearly from the following detailed description of a preferred embodiment thereof, given by way of example only and made with reference to the appended drawings, in which:
Figure 1 illustrates a plurality of sequences derived from a mRNA strand and showing similarity windows,
Figure 2 illustrates in greater detail the existence of a similarity window between two sequences,
Figure 3 shows the structure and contents of an exemplary hash table used in the filtering method of the present invention,
Figures 4a and 4b illustrate a chunk mode of filtering according to the present invention, in bank-against-bank and in bank-against-itself modes, respectively,
Figure 5 shows the structure and contents of a huge hash table used in the filtering method of the present invention, and Figure 6 shows the main steps of the filtering method of the present invention based on the tables shown in Figure 3 and Figure 5,
Figure 7 illustrates the basic operation of the present invention for clustering sequences,
Figure 8 illustrates the cutting of a long sequence into blocks to allow filtering according to the present invention,
Figure 9 illustrates an optional step of the present invention in the context of Figure 8, and
Figure 10 illustrates overlap issues when cutting a long sequence.
Detailed description of the preferred embodiment
a) Definitions
First of all, a more formal definition of the above-mentioned Initial Clustering Similarity (ICS) between two sequences of characters such as two ESTs SI and S2 will be provided. The so-called "Hamming distance" between two sequences will first be defined, and then the similarity measure based on such distance. - Definition 1 : Hamming distance
Let Si and S2 be two sequences of characters, each character belonging to an alphabet __. The Hamming distance between Si and S2, denoted H(Sι, S2), is the minimal number of substitutions needed to transform Si into S2.
- Definition 2 : Gq(S)
Let S be a sequence of n characters SιS2s3....sn belonging to __. Gq(S) is the set of q-grams (words of length q) of S defined as :
Gq(S) = {WJ - Si...si+q_ι, 1 < i < n-q-1 }
(Of course, this set is empty when n < q, i.e. when the sequence length is shorter than the q-gram length.
- Definition 3 : Initial Clustering Similarity
The Initial Clustering Similarity between two sequences Si and S2, denoted ICS(Sι, S2), is defined as :
ICSL(S 1 , S2) = min{(H(wι, w2) | Wj e GL(Sι) and w2 e GL(S2), ∞}
- Definition 4 : k-similarity
Two ESTs SI and S2 are k-similar if :
ICSL(S,, S_) ≤ k
Two ESTs are similar with a percentage P of identities if : ICSL(S,, S2) < I L*(100 - P)/100|
b) Similarity determination
A suitable algorithm is required for verifying whether two ESTs Si and S2 are k-similar with respect to a given window size L.
The theoretical complexity of this problem is undetermined. To the knowledge of the applicant, the only complexity bound concerns the particular case when the length L is equal to the length |Sι| of sequence Si or to the length |S2| of sequence S2. The reason is that, in such case, the algorithm merely has to search one sequence (the shortest) in the other sequence (the longest), allowing at most k mismatches. It has been shown in literature that the worst case complexity in such case is :
O(n^klogk)
From a practical point of view, it may be appropriate to use an algorithm that has a bad worst case complexity but a good complexity for average cases. Moreover, as will be described later, the fact that set of a plurality of sequences Si, S2, ..., Sn is to be compared with another set of a plurality of sequences has to be taken into account when designing the algorithm.
c) Common q-grams filtering Verifying the similarity between two ESTs is expensive in complexity, both theoretically and practically. To decrease the number of elementary operations, a filtering operation is first performed on the two ESTs to determine that they could verify the similarity. In such case, a possibly significant decrease in complexity can be obtained, and relies on (a) the efficiency of the filter and (b) the good practical time complexity of the filtering algorithm.
When the above-mentioned percentage P is large (typically greater than 90 %, which covers the large majority of cases of interest), a suitable filtering method is based on q-grams filtering. As mentioned hereinabove, a q-gram is a merely a sequence or "word" of length q.
Practically, the filtering method consists in counting the number of common q-grams between two sequence sequences, i.e. the number of q- grams that are present in both sequences.
An important relation between filtering and k-similarity determination is that two sequences that could be k-similar necessarily share at least Nq q-grams, where Nq is determined using the following lemma :
Lemma 1
Let Si and S2 be k-similar sequences, i.e.
ICSL(S,, S2) < k In such circumstances, sequences Si and S2 share at least Nq q-grams, where Nq = (L + 1 - (k + l).q) The filter will therefore be useful each time Nq is a positive number. If q is fixed, this means that :
(L + l - (k + l)*q) > 0 i.e.: k < (L + 1 - q)/q i.e.:
P > (100*(L + l)*(q - 1))/L*q according to Definition 4.
Practically, if q-grams having a length q of 10 characters are considered, and if L is large, then the above inequation leads to P > 90 %, a condition for which the filter will be efficient for all practical cases.
d) Multiplicities
In order to filter the two ESTs Si and S2, their common q-grams are counted by using hash tables. A hash table HTq(S) associates to every q- grams in a sequence S its number of occurrences, i.e. its multiplicity, in said sequence. The value q is selected small enough to consider a complete hash table without redundancy.
The complexity of such hash table is of the order 0(|S|), where |S| is the length of sequence S, considering that the hash function is incremental and computable in 0(1) for each new character of S. The size of the computer memory for storing such hash table is equal to σq, where σ is the size of the alphabet _ . According to the present invention, the filtering operation on two sequences Si and S2 is done by the following steps :
(i) an initialization of the hash table for sequence Si is done for once, at the beginning of the whole process, with an order of complexity equal to 0(σq) ; a second table Iq(Sι) also having an order of complexity 0(σq) is also initialized; this table is intended to store a list of indices of the hash table HTq(Sι), i.e. the positions in table HTq(Sι) where there is a strictly positive count value;
(ii) sequence Si is parsed and all its q-grams are considered in sequence ; each time a new q-gram, hashed to index i, is discovered, index i is added in the list of positions Iq(Sι) ; the contents of the cell of the hash table corresponding to position i, denoted HTq(Sι)[i], is incremented so as to indicate that another such q-gram has been identified in sequence Si ; it should be noted here that this step (ii) has an order of complexity of 0(|S 11) ;
(iii) Steps (i) and (ii) are repeated for the second sequence S2, so as to obtain
- a hash table HTq(S2) storing the multiplicities of the q-grams of s2 ; - a table Iq(S2) that stores the marked indices in HTq(S2) ;
(iv) a variable count of a counter is then initialized to 0 ; this variable count aims at storing the number of common q-grams between sequences Si and S2 ; then, for each index i contained in Iq(Sι) which is also present in Iq(S2), the variable count is incremented with the smallest of values HTq(S])[i] and HTq(S2)[i] ; at the end of this process, the value of count is the number of common q-grams between sequences Si and S2; the time complexity of this step is equal to 0(|S1|); (v) cells HTq(Sl)[iJ of table HTq(Sι) are then individually reinitialized, using the list of positions contained in position table Iq(Sι); it should be noted here that this avoids going though the whole table HTq(Sl), as only the cells whose contents had been changed from zero to any positive number are reinitialized; the time complexity of this operation is at most 0(|S1|) ; the same selective reinitialization is performed on table HTq(S2) using position table Iq(S2), with a time complexity at most equal to 0(|S2|).
Regarding overall complexity, the above filter steps have a time and space complexity 0(σq) for the pre-processing time and a complexity 0(|S1| + |S2|) for the main processing.
A simplified example of the above filtering process is shown in Figure 3 of the appended drawings. An exemplary EST sequence Si is equal to "ACGTACACG". The value of q is 2. All sixteen possible q-grams are shown in the Figure. The contents of the one-dimension table HT2(Sι) is also shown, as well as the contents of table l2(Sι). This is a variable length, one-dimension table indicating as a list of the positions of hash table HT2(Sι) having a non-zero value. For instance, the contents of memory cell HT2(Sι)[5] is "1" in the shown example.
e) Pairwise filtering
Although sequence-to-sequence filtering has been described in the foregoing, practical applications will involve comparison of a first set of sequences SetS = {Si, S , S , ..., S with a second set of sequences SetE = {Ei, E , E3, ..., Ee}, where each sequence Sj of sequence set SetS will have to be compared with each sequence Ej of sequence set SetE (pairwise comparison).
A first "basic" approach to this problem would be to use the above- described filter in sequence for each pair (S;, Ej).
Considering the sum of lengths of sequences Si belonging to SetS, denoted Ks : Ks = |Sl| + |S2| + ... + |Ss| and the sum of lengths of sequences Ej belonging to SetE, denoted Ke :
Ke = |El| + |E2| + ... + |Es| the "basic" approach as explained above would have an order of complexity equal to 0(σq) regarding pre-processing time, and a time and the following space complexity PF:
PF = (e.|Sl| + Ke) + (e.|S2| + Ke) + ... + (e.|Ss| + Ke) = e.Ks + s.Ke at running time, using each time 0(σq) memory space.
An improvement to this basic approach would consist in performing the hash processing for once on each sequence Ej of sequence set SetE, and then to compare each S; of sequence set SetS against it (or conversely hashing each S; for once, and comparing each Ej against it).
In such case, it can be understood that the complexity would become: PF* = Ke + (e.|Sl| + e.|S2| + ... + e.|Ss|) = e.Ks + Ke at running time, still requiring 0(σq) memory space. It should be noted here that, if SetE is the same as SetS, which is typically the case for bank-to-itself sequence clustering, then the whole filtering complexity would become:
Figure imgf000024_0001
at running time, still requiring 0(σq) memory space.
f) Considering sequence blocks
An application of the present invention is the clustering of a large bank of ESTs against itself or against another bank of ESTs. Typically, such sets can be as large as 3 billions of sequences. The above-described improvement would indeed accelerate pairwise comparisons, but its worst case complexity PF would also be its average complexity, since every new sequence Sj is hashed every time, which costs 0(jSi|) in terms of complexity.
In order to fasten the whole process on average, an improvement according to an aspect of the present invention consists in performing filtering between an EST Sj belonging to a sequence set SetS containing s ESTs Si, S2, S3, ..., Ss and a preprocessed set SetE containing e ESTs Ei, E_, E3, ..., Ee. The set SetE is called a "chunk" in the following.
The underlying idea is to become less dependent of the real size of the ESTs of and to depend more on the number of common q-grams, that is much smaller on average. Figure 4a diagrammatically illustrates the whole comparison workflow, where it can be seen that a first bank Bi of sequences Ej, E2, ... Ee is subdivided into a plurality of chunks SetEj, SetE2, etc., while each sequence Si, S2, ..., Ss in a second bank B2 is individually used as a query sequence against each chunk, as shown.
If banks B] and B2 and each comprise are the same sequences Ei, E2, ... Ee, then the symmetric character of q-gram filtering as described above is such that a comparison triangle is sufficient, as diagrammatically shown in Figure 4b.
In order to achieve this aim, the multiplicities of q-grams of the sequences contained in a given chunk SetE are practically determined at the chunk level by building in a working memory a Huge Hash Table HHT. As will be illustrated in the following, such hash table associates to each possible q-gram a list of the ESTs contained in SetE in which this q-gram appears, together with its number of occurrences (i.e. its multiplicity) inside each concerned sequence. The size of this table HHT is equal at most to e.σq, where e is the number of sequences in chunk SetE.
A sequence S is compared to the contents of the HHT table by using a one-dimension table D having a size e, where e is the number of sequences in chunk SetE, each cell being initialized to 0 everywhere. Firstly, the sequence S is parsed and the hash table HTq(S) and index table Iq(S) as described in the foregoing are built. Then, for each index i in table Iq(S), which designates a given q-gram wi; the process goes through the list of couples of values (z, k(i, z)) contained in a portion HHT[i] of table HHT, where z is the order the sequence in chunk SetE (i.e. it designates a particular sequence Ez among sequences Ei, E2, ..., Ee) and k(i, z) is the number of times the considered q-gram W; appears in sequence Ez. For each value of z, it is determined which of the values stored in cell HTq(S)[i] and in cell HHT[i, z] is the smallest one, and this value is stored as D[z] in a corresponding cell of table D which is allocated to sequence Ez. It should be noted here that the structures of tables HHT, HTq(S) and Iq(S) is such that all counts which have appeared to be zero (absence of considered q-gram) do not need to be accessed and processed, which significantly contributes to efficiency.
The loading of cells D[z] in table D is incremential, i.e. as soon as a new value is to be loaded in a given cell D[z], it is added to the value previously stored in the same cell.
At the end of the whole process, each value D[z] in table D represents the number of common q-grams between sequences S and Ez, for all values of z comprised between 1 and e.
Then, all sequences Ez that show a value D[z] which is greater than a given threshold Nq are selected for subsequent actual comparison with sequence S (such as k-similarity determination as described in the foregoing). Figure 5 illustrates in a simplified manner the above process. A huge hash table HHT is build for a set of sequences SetE = {Ei, E2, E3, E4} having the following contents:
E1 = "ACGTACACG"
E2 = "AAGTAAAGGT
E3 = "CCGCCA"
E4 = "AAGTACTCG"
The value e is therefore equal to 4 and as in the previous example, q is set to 2.
The contents of table HHT for the chunk SetE appears in Figure 5. For each 2-gram W;, a memory cell HHT[i, z] is filled in table portion HHT[i] each time a sequence Ez having at least one such 2-gram inside it. For instance, the 2-gram "AA" is found three times in sequence E2 and once in sequence E4, so that two corresponding cells are filled under the "AA" 2-gram position. Also particularly enhanced in Figure 5 are: - the fact that cell HHT[i, z] for i = 7 and z = 4 contains 4 as value of z and 1 as value of k(i, z); - the three cells contained in portion HHT[i] for i = 12, with the last cell containing a value of z equal to 4 and a value of k(i, z) equal to 1.
Turning now to Figure 6, a set of sequences SetS = {SI, S2, S3, S4} is provided. It the present example, sequences S1-S4 are identical to sequences E1-E4, respectively, which reflects a "bank-against-itself filtering. After table HHT has been built as described above, sequence Sj is compared to table HHT in Step 1. For this purpose, hash table HT (Sι) and index table I2(Sι) are built. The sequence Si being the same as for the example shown in Figure 3, these tables are identical to those shown in Figure 3.
In addition, a table D having an initial contents of {0, 0, 0, 0} is created.
The process goes as follows: - as first index i in table I2(Sι) is "2", portion HHT[2] of table HHT is parsed, but sequence Ei is skipped (since Ej is the same as Si, and since a sequence-against-itself filtering has no interest);
- sequence E4 (z = 4) is identified there as having one (k(2, 4) = 1) occurrence of 2-gram w2 (being "AC"); - a determination is made of the minimum value between HT2(Sι)[2] (value 3) and HHT[2, 4] (value 1), so that D[4] in table D is loaded with value 1;
- as the end of portion HHT[2] in table HHT has been reached, the next index (i = 5) in table I2(Sι) is read; - portion HHT[5] (corresponding to 2-gram w5 = "CA") in table HHT is then parsed; sequence Ei being skipped, a sequence E3 is identified there as having one occurrence of this 2-gram; the minimum value between HT2(Sι)[5] (value 1) and HHT[5, 3] (value 1), i.e. 1, is loaded as D[3] in table D; - the process is repeated and the contents of cells D[2], D[3] an D[4] are progressively incremented with the minimum values determined as just explained. At the end of the process, it is easy to understand that table D will have the following contents: {X, 2, 2, 3} (X means that the contents is not significant, as it would correspond to a sequence-against-itself filtering).
By the same routine, S2 = E2 is used as query sequence in Step 2 and will lead to a table D having the following contents: {X, X, 0, 4} (the first X means that the contents is not useful, as Si (=Eι) had already been considered against E2 (=S2) in the first run, and the second X means that S2 (=E2) does not have to be considered against itself).
S3 = E3 is then used as query sequence in step 3 and will lead to a table D having the following contents: {X, X, X, 1 }.
Finally, it is useless to use S4 = E4 as query sequence since it has already been considered against all other sequences.
Figure 6 also illustrates an application of the above-described filtering process to the clustering of sequences Si, ..., S4 here again considered as identical to sequences Ei, ..., E4. It is supposed that each of these sequences originally belongs to a different original cluster, respectively CLj, CL2, CL and CL4, as illustrated in Figure 6.
First of all, it should be noted that the above filtering process is followed by a k-similarity determination process designated as "FWSB", having as parameters L=5 (window length) and k=l (maximal number of substitutions) - cf. supra. A practical implementation of this process will be described at a later stage.
From the above-described Lemma 1, Nq, i.e. the threshold for the filtering process, can be computed:
N2 = (5 + l - (l + l).2) = 2
As a consequence, every cell of every table D which will contain a number at least equal to 2 will reveal that the considered sequence couple passes the filter.
The table D as obtained above with sequence Si indicates that Sι=Eι passes the filter with all three sequences E2, E3 and E4, which implies that all couples (Si, E2), (Si, E3) and (Si, E4) will proceed through the k- similarity determination.
Figure 6 shows that only (Si, E2) and (Si, E4) pass the k-similarity determination, so that, sequences Sι=Eι, E2 and E4 now belong to the same cluster CLι24.
The table D obtained with query sequence S2 indicates that couple (S , E4) passes the filter. It is now assumed that the k-similarity determination process on this couple determines that k-similarity is achieved. However, since S ==E2 and E4 already belonged to the same cluster (cf. supra), there is no cluster rearrangement required. Finally, table D obtained with query sequence S3 reveals that sequence pair (S3, E4) does not pass the filter. Therefore, this pair is not directed to the k-similarity determination process process, and the cluster arrangement is left as such.
g) k-similarity determination process
In a practical embodiment of the present invention the k-similarity determination process is based on the well known "BLAST" method, which is described for instance in the paper "Basic Local Alignment Search Tool", cf. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool", Journal of Molecular Biology, 215:403-410, 1990).
More particularly, the present invention preferably uses a slightly modified implementation of BLAST method which is known per se: practically, the process comprises performing the BLAST method among the sequence pairs which have passed the q-gram filtering, so as to provide a fast search of the longest identity regions shared by each sequence pair. A mere checking of the BLAST output is then performed so as to isolate, in such regions, at least one region which satisfies the ICS determination with given parameters L and P (cf. supra).
h) Numerical values To get an idea of the gain which can be expected with a filtering according to the present invention, Table 1 hereinbelow gives numerical values for sets of sequences SetS and SetE each having e = s = 100,000 sequences, when the size n of each sequence is respectively 200, 300, 500 and 1,000. The expected filtering time in a symmetric Bernoulli model (on nucleic sequences, i.e. with an alphabet size σ equal to 4) has been computed for:
(a) a single chunk, assuming that the memory space is large enough for the whole hash table HHT (column "BF" in Table 1);
(b) two chunks of 50,000 ESTs each (column "BF2");
(c) four chunks of 25,000 ESTs each (column "BF4").
Finally, the computation of PF*/BF in the rightmost column of Table 1 gives an estimation of the expected gain with a filtering according to the present invention.
Figure imgf000032_0001
i) Single link clustering
Figure 7 of the appended drawings illustrates the global mechanism for the comparison of two ESTs S and E belonging respectively to two different sets of sequences SetS and SetE, where each set initially forms a respective cluster of sequences.
The sequences S and E are subjected to a clustering process in two main steps:
- first the sequences S and E are subjected to q-gram filtering by using chunks as described in the foregoing;
- when the two ESTs pass trough the filter, i.e. the count for (E, S) is at least equal to Nq, they are compared with each other e.g. with the above- described FWSB process so as to determine whether they are k-similar. If so, the process considers that sequences S and E should belong to the same cluster, so that clusters SetS and SetE are merged together into a cluster SetX.
The links between the ESTs belonging to such a resulting cluster (SetE SetS) should now be considered. With the systematic approach as described in the foregoing, according to which every sequence is tested against each other one, all the ESTs inside said resulting cluster have been tested against the others for similarity, and certain couples have been declared similar while other couples have been declared as not similar. This results from the transitivity implied by cluster merging.
In the meantime, the following observations can be made: (a) a user is generally interested in some specific clusters, but not all clusters: for instance, a user will merge his own (proprietary) sequences with public ones and be interested in the clusters where at least one of its proprietary sequence appears, (b) studying a cluster in depth requires to re-compute links between all ESTs in the set with a tighter or stricter similarity notion and a fine overlap of weights. In consequence, it is not necessary to compute all the similarity links between the ESTs of the same cluster.
According to an improvement of the present invention, a so-called "single link" clustering is performed, i.e. if two ESTs already belong to the same cluster by transitivity, the actual similarity test between both is not performed.
More precisely, a so-called "Union-Find algorithm" as disclosed e.g. in the paper "Efficiency of a good but non linear set union algorithm, R.E. Tarjan, Journal of the ACM, 22(2), pp. 215-225, 1975, is used for testing whether two sequences are already in the same cluster or not before performing the time and memory space consuming comparison of the two sequences by the FWSB similarity process as described in the foregoing.
It could be demonstrated that the Union-Find algorithm is not linear, but in practice is very close to being linear. From there, it could be shown that the tests for determining whether two sequences belong to the same cluster (i.e. whether they have the same cluster attribute) requires at most an additional time or the order of log* (n).
For instance, and turning back to Figure 6, this test implies that the FWSB process on the pair (S2, E4) will not be performed as S2(=E ) and E4 have previously been grouped into the same cluster. k) Mapping sequences on genomes
The present invention can easily be adapted to map ESTs on a genome G. The basic principle is to split the genome G in blocks having a given length d that overlap with o bases, and to consider all these blocks as the queries for the filtering process. Figure 8 shows such a split chromosome. In order to avoid the risk of undetected k-similarities, the overlap is chosen at least equal to the window length L selected by the operator at the beginning of the process and used in the FWSB k- similarity process. This will be further explained in the following.
At this stage, this approach is similar to the QUASAR prior art mentioned in the introductory part of the present application.
The process according to the present invention is as follows:
- as shown in Figure 8, the genome G is firstly split into blocks Bi, ..., Bm of length d (except last block Bm which may have a different length, d being chosen greater than the length of the longest EST in the set of sequences SetS to which genome G is to be compared, with an overlap of L bases, L being the length of the similarity window;
- a huge hash table HHT is then built for the set of blocks Bi, ..., Bm obtained by this process; - all the ESTs of SetS are individually compared to the blocks Bi, ..., Bm by parsing table HHT as described above; - however, an important feature of the inventive process is that, if a given EST Sj of SetS passes through the filter with at least two contiguous blocks, e.g. with three blocks Bj, Bj+i and Bj+2, then the k-similarity determination is performed between EST Si and a "superblock" SB which is the concatenation said contiguous blocks, i.e. Bj, Bj-n and Bj+2 in the present species, with elimination of the overlap(s);
- all the longest matching positions for which the k-similarity between Sj and the superblock SB are reported.
This is illustrated in Figure 9 of the drawings.
Turning back to the above-mentioned overlap of length L between to consecutive blocks, it has the purpose of avoiding the situation shown in Figure 10 of the drawings, where an overlap o smaller than L has been selected. In such a situation, the occurrence of a sequence lying at the overlap between two consecutive blocks Bk, Bk+i and similar to a query sequence Sk could be missed since none of the blocks B or Bk+ι would have enough q-grams in common with the query sequence to pass through the filter.
The flexibility of the above split-and-reconcatenate approach derives from the use of hash tables, which strongly simplifies the whole process as will be explained in the following:
- first of all, since there is no pre-processing required, a genome can be split into blocks "on the run" with an overlap preferably equal to the window length L selected by the user as a parameter, which allows different, optimal genome splittings depending on the value of L; - secondly, by concatenating again the blocks which have passed the filtering process with a given query sequence, as said blocks initially were, erroneous double matchings are avoided, as matching portions located at the overlap between two blocks will be merged again into a single matching portion.
Many variants and modification can be brought by the one skilled in the art to the present invention. In particular, the filtering and similarity processes can be easily parallelized in a multi-processor environment, whether in a shared or distributed memory architecture.
The following Table II indicates the speed-up achieved by parallel processing in the practical example of a set of 100000 ESTs compared against itself, with each EST having a length greater than 800 and the average length being 927. The computer environment was an AlphaServer ES40/466 with four 666 MHz processors, as available at Compaq.
Table II
Figure imgf000037_0001
Figure imgf000038_0001

Claims

1. A method for filtering a plurality of nucleic or proteic sequences (Ej, ..., Ee; Si, ..., Ss; SetE, SetS) for the purpose of subsequent sequence similarity determination, each position in each sequence belonging to a predetermined alphabet (_ ), and said sequences being stored in a computer memory, the method comprising the following steps:
(a) determining a word length (q) having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet (_ ) a set of possible words (q-grams) (w;) having length q,
(b) storing in a working memory a plurality of first sequences
({Ez}), (c) for each q-gram and for each sequence (Ez), counting the number of occurrences of the q-gram in the sequence,
(d) constructing and storing in said working memory, in association with each q-gram (Wj), a respective count set (HHT[i]) containing, for each member of the count set, an identifier (z) of a first sequence (Ez) having at least one occurrence of the q-gram and an first occurrence count (k(i,z)) of the q-gram in sequence (Ez), as determined in step (c),
(e) counting the number of occurrences of each q-gram (Wj) in a second sequence (Sy) for generating second occurrence counts (HTq(Sy)[iJ),
(f) storing in the working memory the second occurrence counts (HTq(Sy)[i]) as determined in step (e) as well as a set (Iq(Sy)) of identifiers (i) of q-grams (wj) for which the counts (HTq(Sy)[i]) are strictly positive,
(g) for each q-gram identifier (i) contained in set (Iq(Sy)):
(gl) reading the count set (HHT[i]) constructed in step (d) in relation with the q-gram (w;),
(g2) for each sequence (Ez) identified in count set (HHT[i]), determining the minimum value between the corresponding first occurrence count (k(i,z)) and the second occurrence count
(HTq(Sy)[i]), (h) for each sequence (Ez), determining and storing a common q- gram score (D[zj) with said second sequence (Sy) based on the minimum values determined in step (g2) for said first sequence, and
(i) rejecting sequence (Ez) and sequence (Sy) as similarity candidates for each sequence (Ez) for which score (D[z]) is lower than a given threshold value.
2. A method according to claim 1, wherein steps (e) to (i) are performed for a plurality of second sequences (Sy) with use of the same . count sets (HHT[i]) as constructed and stored in step (d).
3. A method according to claim 2, in which said second sequence • (Sy) belongs to a set of sequences (Si, ..., Ss; SetS) which is identical to said set of first sequences (Ei, ..., Ee; SetE), wherein steps (g2), (h) and (i) are omitted for a first sequence (Ez) which is identical to the second sequence (Sy) and for any first sequence (Ez) which has already been used as second sequence (Sy) in steps (g2), (h) and (i).
4. A method for determining the similarities between a plurality of nucleic or proteic sequences (Ei, ..., Ee; Si, ..., Ss; SetE, SetS), each position in each sequence belonging to a predetermined alphabet (_ ), and said sequences being stored jn a computer memory, the method comprising the following steps:
(a) determining a word length (q) having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet (__) a set of possible words
(q-grams) (w;) having length q, (b) storing in a working memory a plurality of first sequences
({Ez}),
(c) for each q-gram and for each sequence (Ez), counting the number of occurrences of the q-gram in the sequence,
(d) constructing and storing in said working memory, in association with each q-gram (WJ), a respective count set (HHT[ij) containing, for each member of the count set, an identifier (z) of a first sequence (Ez) having at least one occurrence of the q-gram and an first occurrence count (k(i,z)) of the q-gram in sequence (Ez), as determined in step (c), (e) counting the number of occurrences of each q-gram (w;) in a second sequence (Sy) for generating second occurrence counts (HTq(Sy)[i])5
(f) storing in the working memory the second occurrence counts (HTq(Sy)[i]) as determined in step (e) as well as a set (Iq(Sy)) of identifiers (i) of q-grams (WJ) for which the counts (HTq(Sy)[i]) are strictly positive,
(g) for each q-gram identifier (i) contained in set (Iq(Sy)): (gl) reading the count set (HHT[i]) constructed in step (d) in relation with the q-gram (Wj),
(g2) for each sequence (Ez) identified in count set (HHT[ij), determining the minimum value between the corresponding first occurrence count (k(i,z)) and the second occurrence count
(HTq(Sy)[i]),
(h) for each sequence (Ez), determining and storing a common q- gram score (D[z]) with said second sequence (Sy) based on the minimum values determined in step (g2) for said first sequence, (i) accepting sequence (Ez) and sequence (Sy) as similarity candidates for each sequence (Ez) for which the common q-gram score (D[z]) is higher than a given threshold value,
(j) for each accepted pair of sequences (Ez, Sy), performing a similarity determination.
5. A method according to claim 4, wherein steps (e) to (j) are performed for a plurality of second sequences (Sy) with use of the same count sets (HHT[i]) as constructed and stored in step (d).
6. A method according to claim 4 or 5, wherein said similarity determination comprises the following steps:
(k) determining a sub-sequence length (L) lower than the lengths of the sequences (Ez, Sy),
(1) determining a percentage (P) of identities, (m) determining whether the two sequences (Ez, Sy) share a common sub-sequence having length (L) and having a percentage (P) of identical characters.
7. A method according to claim 4 to 6, wherein said plurality of said first sequences (Ez) are constituted by overlapping blocks (Bj) of a long sequence such as a genome (G), and wherein, after at least two steps (i) have accepted as candidates a given second sequence (S) and at least two first sequences (Bj, Bj+j, ...) which are contiguous blocks in long sequence (G), step (j) is performed on the second sequence (S) and to a supersequence (SB) constituted by the first sequences (Bj, Bj+i, ...) re- concatenated with elimination of overlap (L).
8. A method according to claims 6 and 7 taken in combination, further comprising an initial step in which a user selects a sub-sequence length (L) for similarity determination, and a step of constructing said overlapping blocks with an overlap length (L) equal to the sub-sequence length.
9. A method for clustering a plurality of nucleic or proteic sequences (Ei, ..., Ee; Si, ..., Ss; SetE, SetS), each position in each sequence belonging to a predetermined alphabet (_ ), said sequences being stored in a computer memory, the method comprising the following steps:
(a) determining a word length (q) having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet (∑) a set of possible words
(q-grams) (Wj) having length q, (b) storing in a working memory a plurality of first sequences
({Ez}); (c) for each q-gram and for each sequence (Ez), counting the number of occurrences of the q-gram in the sequence,
(d) constructing and storing in said working memory, in association with each q-gram (w,), a respective count set (HHT[i]) containing, for each member of the count set, an identifier (z) of a first sequence (Ez) having at least one occurrence of the q-gram and an first occurrence count (k(i,z)) of the q-gram in sequence (Ez), as determined in step (c),
(e) counting the number of occurrences of each q-gram (w;) in a second sequence (Sy) for generating second occurrence counts
(HTq(Sy)[i]),
(f) storing in the working memory the second occurrence counts (HTq(Sy)[i]) as determined in step (e) as well as a set (Iq(Sy)) of identifiers (i) of q-grams (Wj) for which the counts (HTq(Sy)[i]) are strictly positive,
(g) for each q-gram identifier (i) contained in set (Iq(Sy)):
(gl) reading the count set (HHT[i]) constructed in step (d) in relation with the q-gram (Wj),
(g2) for each sequence (Ez) identified in count set (HHT[i]), determining the minimum value between the corresponding first occurrence count (k(i,z)) and the second occurrence count
Figure imgf000044_0001
(h) for each sequence (Ez), determining and storing a common q- gram score (D[zj) with said second sequence (Sy) based on the minimum values determined in step (g2) for said first sequence, (i) accepting sequence (Ez) and sequence (Sy) as similarity candidates for each sequence (Ez) for which said common word count is higher than a given threshold value,
(j) for each accepted pair of sequences (Ez, Sy), performing a similarity determination, and
(k) for each pair of sequences (Ez, Sy) determined as similar, attaching to said sequences a common cluster attribute.
10. A method according to claim 9, wherein steps (e) to (k) are performed for a plurality of second sequences (Sy) with use of the same count sets (HHT[i]) as constructed and stored in step (d).
1 1. A method according to claim 9 or 10, further comprising the following steps: (1) identifying possible other sequences which where previously sharing a common cluster attribute with either of the sequences determined as similar, and
(m) attaching to said other sequences said common cluster attribute.
12. A method according to claim 11, further comprising, before step (j), the steps of : checking whether the similarity candidates have a same cluster attribute, if not, performing step (j), if so, skipping step (j) and passing to the next first sequence (Ez) or second sequence (Sy).
PCT/IB2002/002019 2002-03-27 2002-03-27 Method for filtering nucleic or proteic sequence data WO2003081507A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2002309085A AU2002309085A1 (en) 2002-03-27 2002-03-27 Method for filtering nucleic or proteic sequence data
PCT/IB2002/002019 WO2003081507A1 (en) 2002-03-27 2002-03-27 Method for filtering nucleic or proteic sequence data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2002/002019 WO2003081507A1 (en) 2002-03-27 2002-03-27 Method for filtering nucleic or proteic sequence data

Publications (1)

Publication Number Publication Date
WO2003081507A1 true WO2003081507A1 (en) 2003-10-02

Family

ID=28053188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2002/002019 WO2003081507A1 (en) 2002-03-27 2002-03-27 Method for filtering nucleic or proteic sequence data

Country Status (2)

Country Link
AU (1) AU2002309085A1 (en)
WO (1) WO2003081507A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018173042A1 (en) * 2017-03-19 2018-09-27 Ofek - Eshkolot Research And Development Ltd System and method for generating filters for k-mismatch search

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BURKE J ET AL: "d2_cluster: a validated method for clustering EST and full-length cDNAsequences.", GENOME RESEARCH. UNITED STATES NOV 1999, vol. 9, no. 11, November 1999 (1999-11-01), pages 1135 - 1142, XP002235681, ISSN: 1088-9051, Retrieved from the Internet <URL:http://www.genome.org/cgi/reprint/9/11/1135.pdf> [retrieved on 20030319] *
BURKHARDT S ET AL: "q-gram Based Database Searching Using a Suffix Array (QUASAR)", THIRD ANNUAL INTERNATIONAL CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY RECOMB'99, 11 April 1999 (1999-04-11) - 14 April 1999 (1999-04-14), XP002193332 *
KÄRKKÄINEN J AND SUTINEN E: "Lempel-Ziv Index for q-Grams", ALGORITHMICA, vol. 21, 8 May 1998 (1998-05-08), pages 137 - 154, XP002235680, Retrieved from the Internet <URL:http://link.springer.de/link/service/journals/00453/papers/21n1p137.pdf> [retrieved on 20030321] *
LARSON P-A: "DYNAMIC HASH TABLES", COMMUNICATIONS OF THE ASSOCIATION FOR COMPUTING MACHINERY, ASSOCIATION FOR COMPUTING MACHINERY. NEW YORK, US, vol. 31, no. 4, 1 April 1988 (1988-04-01), pages 456 - 457, XP000051583, ISSN: 0001-0782 *
NAVARRO G ET AL: "Indexing Methods for Approximate String Matching", IEEE DATA ENGINEERING BULLETIN, vol. 24, no. 4, 2001, pages 19 - 27, XP002235682, Retrieved from the Internet <URL:http://www.dcc.uchile.cl/~gnavarro/ps/deb01.ps.gz> [retrieved on 20030321] *
NAVARRO G: "A Guided Tour to Approximate String Matching", ACM COMPUTING SURVEYS, vol. 33, no. 1, 1 March 2001 (2001-03-01), pages 31 - 88, XP002235679, Retrieved from the Internet <URL:http://www.csis.hku.hk/~nikos/courses/CSIS7101/papers/navarro.pdf> [retrieved on 20030320] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018173042A1 (en) * 2017-03-19 2018-09-27 Ofek - Eshkolot Research And Development Ltd System and method for generating filters for k-mismatch search
EP3602350A4 (en) * 2017-03-19 2021-01-27 Ofek Eshkolot Research And Development Ltd. System and method for generating filters for k-mismatch search
US11250064B2 (en) 2017-03-19 2022-02-15 Ofek—Eshkolot Research And Development Ltd. System and method for generating filters for K-mismatch search

Also Published As

Publication number Publication date
AU2002309085A1 (en) 2003-10-08

Similar Documents

Publication Publication Date Title
US20240153584A1 (en) Systems and methods for analyzing sequence data
US6457006B1 (en) System and method for organizing data
US7640256B2 (en) Data collection cataloguing and searching method and system
US6636849B1 (en) Data search employing metric spaces, multigrid indexes, and B-grid trees
US10521441B2 (en) System and method for approximate searching very large data
US7657506B2 (en) Methods and apparatus for automated matching and classification of data
US5701256A (en) Method and apparatus for biological sequence comparison
US8271206B2 (en) DNA sequence assembly methods of short reads
US20030033279A1 (en) Methods and apparatus for high-speed approximate sub-string searches
Giegerich et al. Efficient implementation of lazy suffix trees
Zhai et al. ATLAS: a probabilistic algorithm for high dimensional similarity search
Yanovsky ReCoil-an algorithm for compression of extremely large datasets of DNA data
US20070038381A1 (en) Efficient method for alignment of a polypeptide query against a collection of polypeptide subjects
US20030157489A1 (en) Recursive categorical sequence assembly
US8788522B2 (en) Pair character string retrieval system
US10545960B1 (en) System and method for set overlap searching of data lakes
Roberts et al. A preprocessor for shotgun assembly of large genomes
US11645292B2 (en) Efficient similarity search
Bluis et al. Nodal distance algorithm: Calculating a phylogenetic tree comparison metric
Deveci et al. Hypergraph sparsification and its application to partitioning
Olson An approximation algorithm for least median of squares regression
WO2003081507A1 (en) Method for filtering nucleic or proteic sequence data
Nicolas et al. Hardness of optimal spaced seed design
Hakata et al. Algorithms for the longest common subsequence problem for multiple strings based on geometric maxima
Aung et al. An indexing scheme for fast and accurate chemical fingerprint database searching

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP