WO2003081507A1

WO2003081507A1 - Method for filtering nucleic or proteic sequence data

Info

Publication number: WO2003081507A1
Application number: PCT/IB2002/002019
Authority: WO
Inventors: Eric Glemet
Original assignee: Gene-It
Priority date: 2002-03-27
Filing date: 2002-03-27
Publication date: 2003-10-02
Also published as: AU2002309085A1

Abstract

The present invention provides filtering of nucleic or proteic sequences for the purpose of similarity determination. The subsequent similarity determination is used e.g. for sequence clustering.

Description

METHOD FOR FILTERING NUCLEIC OR PROTEIC SEQUENCE DATA

The present invention generally relates to nucleic and proteic sequence comparison and classification systems and methods. Background of the invention a) Generalities ^'

Nucleic and proteic sequences comparison and classification is one of the oldest and more persistent problem in computational biology. Many metrics and similarities have been defined according to the biological underlying problem.

The initial comparison of expressed sequence tag (hereinafter ""EST" or "sequence" or "sequence data") sequences for the purpose of generating "clusters" of sequences having a given degree of similarity is based on the fact that such sequences are suppose to strongly overlap.

The Initial Clustering Similarity (ICS) is the basic similarity used in most systems and methods performing initial clustering of sequences. Figure 1 of the appended drawings shows five ESTs Si to S₅ derived from a messenger RNA sequence designated as mRNA, with supposedly overlapping regions OR1 to OR4 shown in dark. Turning now to Figure 2, two ESTs Si and S₂ are considered similar according to the ICS criterion if they share a window of length L with at least a percentage P of identical members. Typically, if ESTs having each a length about 300- 400 nucleic base pairs is considered, the window length is generally chosen equal to about 150 and the percentage P equal to about 96%. In most known systems, the user can adapt the length L and the percentage P as a function of the particularities of the set of sequences to be processed.

A previous EST clustering method known as "d2" is the subject matter of paper "d2_cluster: A Validated Method for Clustering EST and Full- Length cDNA Sequences", J.Burke et al., Genome Research, 9(11):1135-1142, 1999).

However, this known method has a number of drawbacks, and the teachings of this paper have a number of limitations. First of all, this method is limited to the comparison between sequences inside a single EST set. This means that it cannot be used for detecting similarities between sequences belonging to two different sets, which is a strong limitation in view of the current expectations from the market.

In addition, this prior document is far from being clear concerning how to determine whether two ESTs are similar or not. More particularly, the authors of this document assert that the method is capable of verifying a possible overlap of length L with a certain percentage of identities P using the d2 method, and refer in this regard to a prior paper "Biological evaluation of d2, an algorithm for high-performance sequence comparison". . Hide et al. Journal of Computational Biology, 1 :199- 215, 1994.

However, in this prior paper, the d2 algorithm is devoted to compute a measure between two biological sequences and is hardly related to a similarity detection based on a window of length L and a percentage of identities P. As a consequence, the one skilled in the art would not be capable of practically implementing the d2 algorithm for performing (L, P) comparisons.

Last but not least, it can be in any case understood from the d2 prior art that the actual complexity of the computations will proportional to the square of the sum of all the lengths of the sequences contained in the starting set, which leads to impracticable execution times even with state of the art computer systems .

Therefore, these prior art references do not provide or even suggest a practical (L, P) similarity determination.

Besides, and turning now to issues of computing capabilities, attempts have been made recently for decreasing the computing work required for pair-to-pair comparison of sequences. The paper "q-gram Based Database Searching Using a Suffix Array (QUASAR)", S. Burkhardt et al. RECOMB 99, Third Annual International Conference on Computational Molecular Biology discloses a method for determining similarity between a query sequence and sequences contained in a database based on the principle of counting occurrences of short words (hereinafter "q-grams", i.e. words having a length of q characters) in the query sequence and in the database, and detecting similarity if the count of q-grams in common exceeds a limit. This known method, although it allows much faster processing and requires less working memory than other algorithms such as basic local alignment algorithms, however has a number of drawbacks.

First of all, this known method use of a heavy indexing structure, i.e. a suffix array, which requires to be preprocessed on the whole starting database and cannot be practically performed "on the run". This means that the suffix array is valid only as long as the database contents does not change, which is not realistic: as this heavy preprocessing would have to be done each time the database is changed, then the computer system would spend most of its processing time, or in any case a huge part of its resources, for building an updated suffix array.

Another drawback of this known system is that, when it is applied to a very long sequence (typically a genome) split into smaller blocks, then a predetermined, longest possible overlap between blocks has to be provided, so as to avoid in any case that an actually pertinent sequence might remain undetected as being split among two consecutive blocks (this will be explained in greater detail in the description).

However, the systematic use of a long overlap, and in particular an overlap much longer than necessary, will result in: (a) performing many unnecessary q-gram count comparison against blocks, (b) generating two different matchings while there is only one occurrence of a pertinent sequence, as this pertinent sequence, if contained in the overlap, will appear in two different blocks. Summary of the invention

A first object of the present invention is to provide a method for detecting similar sequences of characters such as nucleic or proteic sequences which can operate on a sequence-against-sequence basis while benefiting from significant gain in terms of processing time and required memory space based on a block processing of a plurality of sequences.

Another object of the present invention is to provide a method that does not rely upon a preprocessed heavy suffix array or equivalent, i.e. that can be directly implemented on any selected set of sequences, and even on a global database, provided that enough working memory space is available, and that can be run efficiently even if changes occur in the contents of the database.

Another object of the present invention is to allow a practical and efficient implementation of a clustering method involving a similarity determination based on the existence in a sequence pair of a common window of length L with a percentage P of identical characters.

Still another object of the present invention, is to decrease as far as possible the number of required elementary filtering operations by benefiting from the properties of cluster merging.

Another object of the present ^• invention, when dealing with long sequences (such as genomic sequences) split into blocks, is to allow an adjustment of the overlap between blocks to decrease the number of elementary filtering operations to what is strictly needed and to avoid double matching report when there is only one pertinent sub-sequence in the long sequence.

According to a first aspect, the present invention provides a method for filtering a plurality of nucleic or proteic sequences Ei, ..., E_e (SetE) and/or Si, ..., S_s (SetS) for the purpose of subsequent sequence similarity determination, each position in each sequence belonging to a predetermined alphabet __, and said sequences being stored in a computer memory, the method comprising the following steps:

(a) determining a word length q having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet __ a set of possible words (q-grams) wj having length q,

(b) storing in a working memory a plurality of first sequences

{E_z},

(c) for each q-gram Wj and for each sequence E_z, counting the number of occurrences of the q-gram in the sequence, (d) constructing and storing in said working memory, in association with each q-gram w_i5 a respective count set HHT[i] containing, for each member of the count set, an identifier z of a first sequence E_z having at least one occurrence of the q-gram and an first occurrence count k(i,z) of the q-gram in sequence E_z, as determined in step (c),

(e) counting the number of occurrences of each q-gram wj in a second sequence S_y for generating second occurrence counts HT_q(S_y)[i], (f) storing in the working memory the second occurrence counts HT_q(S_y)[i] as determined in step (e) as well as a set I_q(S_y) of identifiers i of q-grams W; for which the counts HT_q(S_y)[i] are strictly positive,

(g) for each q-gram identifier i contained in set I_q(S_y): (gl) reading the count set HHT[i] constructed in step (d) in relation with the q-gram Wj,

(g2) for each sequence E_z identified in count set HHT[i], determining the minimum value between the corresponding first occurrence count k(i,z) and the second occurrence count HT_q(S_y)[i],

(h) for each sequence E_z, determining and storing a common q- gram score D[z] with said second sequence S_y based on the minimum values determined in step (g2) for said first sequence, and

(i) rejecting sequence E_z and sequence S_y as similarity candidates for each sequence E_z for which score D[z] is lower than a given threshold value.

Advantageously, steps (e) to (i) are performed for a plurality of second sequences (S_y) with use of the same count sets (HHT[i]) as constructed and stored in step (d), so that the processing resources and memory space used for count sets HHT[i] are saved.

Preferably, when said second sequence S_y belongs to a set of sequences Si, ..., S_s (SetS) which is identical to said set of first sequences Ei, ..., E_e

(SetE), then steps (g2), (h) and (i) are omitted for a first sequence E_z which is identical to the second sequence S_y and for any first sequence E_z which has already been used as second sequence S_y in steps (g2), (h) and (i). This avoids useless work consisting in filtering a given sequence against itself of against another sequence when said another sequence has already been filtered against said given sequence.

According to a second aspect, the present invention provides a method for determining the similarities between a plurality of nucleic or proteic sequences Ei, ..., E_e (SetE) and/or Si, ..., S_s (SetS), each position in each sequence belonging to a predetermined alphabet __, and said sequences being stored in a computer memory, the method comprising the following steps:

(a) determining a word length q having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet __ a set of possible words (q-grams) W; having length q, (b) storing in a working memory a plurality of first sequences

(E ,

(c) for each q-gram Wj and for each sequence E_z, counting the number of occurrences of the q-gram in the sequence,

(d) constructing and storing in said working memory, in association with each q-gram Wj, a respective count set HHT[i] containing, for each member of the count set, an identifier z of a first sequence E_z having at least one occurrence of the q-gram and an first occurrence count k(i,z) of the q-gram in sequence E_z, as determined in step (c), (e) counting the number of occurrences of each q-gram Wj in a second sequence S_y for generating second occurrence counts HT_q(S_y)[i], (f) storing in the working memory the second occurrence counts HT_q(S_y)[i] as determined in step (e) as well as a set I_q(S_y) of identifiers i of q-grams w_; for which the counts HT_q(S_y)[i] are strictly positive,

(g) for each q-gram identifier i contained in set I_q(S_y): (gl) reading the count set HHT[i] constructed in step (d) in relation with the q-gram W;,

(h) for each sequence E_z, determining and storing a common q- gram score D[z] with said second sequence S_y based on the minimum values determined in step (g2) for said first sequence,

(i) accepting sequence E_z and sequence S_y as similarity candidates for each sequence E_z for which the common q-gram score D[z] is higher than a given threshold value, and

(j) for each accepted pair of sequences (E_z, S_y), performing a similarity determination.

It is also preferred here that steps (e) to (j) are performed for a plurality of second sequences (S_y) with use of the same count sets (HHT[i]) as constructed and stored in step (d).

In a preferred embodiment, said similarity determination comprises the following steps:

(k) determining a sub-sequence length L lower than the lengths of the sequences E_z, S_y, (1) determining a percentage P of identities,

(m) determining whether the two sequences E_z, S_y share a common sub-sequence having length L and having a percentage P of identical characters.

In the case of very long sequences such as genomes, said plurality of said first sequences E_z may be constituted by overlapping blocks B_j of the long sequence G, and wherein, after at least two steps (i) have accepted as candidates a given second sequence S and at least two first sequences B_j, B_j+lj ... which are contiguous blocks in long sequence G, step (j) is performed on the second sequence S and to a supersequence SB constituted by the first sequences B_j, B_j+i, ... re-concatenated with elimination of overlap L.

According to another preferred feature, the method comprises an initial step in which a user selects a sub-sequence length L for similarity determination, and a step of constructing said overlapping blocks with an overlap length L equal to the sub-sequence length.

According to a further aspect, the present invention provides a method for clustering a plurality of nucleic or proteic sequences Ei, ..., E_e (SetE) and/or Si, ..., S_s (SetS), each position in each sequence belonging to a predetermined alphabet _ , said sequences being stored in a computer memory, the method comprising the following steps: (a) determining a word length q having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet __ a set of possible words (q-grams) Wj having length q,

(b) storing in a working memory a plurality of first sequences

(E₂), (c) for each q-gram w_; and for each sequence E_z, counting the number of occurrences of the q-gram in the sequence,

(d) constructing and storing in said working memory, in association with each q-gram w,, a respective count set HHT[i] containing, for each member of the count set, an identifier z of a first sequence E_z having at least one occurrence of the q-gram and an first occurrence count k(i,z) of the q-gram in sequence E_z, as determined in step (c),

(e) counting the number of occurrences of each q-gram Wj in a second sequence S_y for generating second occurrence counts HT_q(S_y)[i], (f) storing in the working memory the second occurrence counts

HT_q(S_y)[i] as determined in step (e) as well as a set I_q(S_y) of identifiers i of q-grams wj for which the counts HT_q(S_y)[i] are strictly positive, (g) for each q-gram identifier i contained in set I_q(S_y):

(gl) reading the count set HHT[i] constructed in step (d) in relation with the q-gram $,

(g2) for each sequence E_z identified in count set HHT[i], determining the minimum value between the corresponding first occurrence count k(i,z) and the second occurrence count

HT_q(S_y)[i], (h) for each sequence E_z, determining and storing a common q- gram score D[z] with said second sequence S_y based on the minimum values determined in step (g2) for said first sequence, (i) accepting sequence E_z and sequence S_y as similarity candidates for each sequence E_z for which said common word count is higher than a given threshold value,

(j) for each accepted pair of sequences (E_z, S_y), performing a similarity determination, and

(k) for each pair of sequences (E_z, S_y) determined as similar, attaching to said sequences a common cluster attribute.

Steps (e) to (k) are preferably performed for a plurality of second sequences (S_y) with use of the same count sets (HHT[i]) as constructed and stored in step (d), so that the processing resources and memory space used for count sets HHT[i] are saved.

Advantageously, this method additionally comprises the following steps: (1) identifying possible other sequences which where previously sharing a common cluster attribute with either of the sequences determined as similar, and

(m) attaching to said other sequences said common cluster attribute. further comprising the following steps:

In addition, the method further comprises, before step (j), the steps of checking whether the similarity candidates have a same cluster attribute, if not, performing step (j), if so, skipping step (j) and passing to the next first sequence E_z or second sequence S_y. This avoid useless similarity determination e.g. when the purpose of the method is to merge two clusters as soon as two sequences respectively belonging to these two clusters are reported as similar.

Brief description of the drawings

Other aims, aspects and advantages of the present invention will appear more clearly from the following detailed description of a preferred embodiment thereof, given by way of example only and made with reference to the appended drawings, in which:

Figure 1 illustrates a plurality of sequences derived from a mRNA strand and showing similarity windows,

Figure 2 illustrates in greater detail the existence of a similarity window between two sequences,

Figure 3 shows the structure and contents of an exemplary hash table used in the filtering method of the present invention,

Figures 4a and 4b illustrate a chunk mode of filtering according to the present invention, in bank-against-bank and in bank-against-itself modes, respectively,

Figure 5 shows the structure and contents of a huge hash table used in the filtering method of the present invention, and Figure 6 shows the main steps of the filtering method of the present invention based on the tables shown in Figure 3 and Figure 5,

Figure 7 illustrates the basic operation of the present invention for clustering sequences,

Figure 8 illustrates the cutting of a long sequence into blocks to allow filtering according to the present invention,

Figure 9 illustrates an optional step of the present invention in the context of Figure 8, and

Figure 10 illustrates overlap issues when cutting a long sequence.

Detailed description of the preferred embodiment

a) Definitions

First of all, a more formal definition of the above-mentioned Initial Clustering Similarity (ICS) between two sequences of characters such as two ESTs SI and S2 will be provided. The so-called "Hamming distance" between two sequences will first be defined, and then the similarity measure based on such distance. - Definition 1 : Hamming distance

Let Si and S₂ be two sequences of characters, each character belonging to an alphabet __. The Hamming distance between Si and S₂, denoted H(Sι, S₂), is the minimal number of substitutions needed to transform Si into S₂.

- Definition 2 : G_q(S)

Let S be a sequence of n characters SιS₂s₃....s_n belonging to __. G_q(S) is the set of q-grams (words of length q) of S defined as :

G_q(S) = {W_J - Si...s_i+q_ι, 1 < i < n-q-1 }

(Of course, this set is empty when n < q, i.e. when the sequence length is shorter than the q-gram length.

- Definition 3 : Initial Clustering Similarity

The Initial Clustering Similarity between two sequences Si and S₂, denoted ICS(Sι, S₂), is defined as :

ICS_L(S 1 , S2) = min{(H(wι, w₂) | Wj e G_L(Sι) and w₂ e G_L(S₂), ∞}

- Definition 4 : k-similarity

Two ESTs SI and S2 are k-similar if :

ICS_L(S,, S_) ≤ k

Two ESTs are similar with a percentage P of identities if : ICS_L(S,, S₂) < I L*(100 - P)/100|

b) Similarity determination

A suitable algorithm is required for verifying whether two ESTs Si and S₂ are k-similar with respect to a given window size L.

The theoretical complexity of this problem is undetermined. To the knowledge of the applicant, the only complexity bound concerns the particular case when the length L is equal to the length |Sι| of sequence Si or to the length |S₂| of sequence S₂. The reason is that, in such case, the algorithm merely has to search one sequence (the shortest) in the other sequence (the longest), allowing at most k mismatches. It has been shown in literature that the worst case complexity in such case is :

O(n^klogk)

From a practical point of view, it may be appropriate to use an algorithm that has a bad worst case complexity but a good complexity for average cases. Moreover, as will be described later, the fact that set of a plurality of sequences Si, S₂, ..., S_n is to be compared with another set of a plurality of sequences has to be taken into account when designing the algorithm.

c) Common q-grams filtering Verifying the similarity between two ESTs is expensive in complexity, both theoretically and practically. To decrease the number of elementary operations, a filtering operation is first performed on the two ESTs to determine that they could verify the similarity. In such case, a possibly significant decrease in complexity can be obtained, and relies on (a) the efficiency of the filter and (b) the good practical time complexity of the filtering algorithm.

When the above-mentioned percentage P is large (typically greater than 90 %, which covers the large majority of cases of interest), a suitable filtering method is based on q-grams filtering. As mentioned hereinabove, a q-gram is a merely a sequence or "word" of length q.

Practically, the filtering method consists in counting the number of common q-grams between two sequence sequences, i.e. the number of q- grams that are present in both sequences.

An important relation between filtering and k-similarity determination is that two sequences that could be k-similar necessarily share at least N_q q-grams, where N_q is determined using the following lemma :

Lemma 1

Let Si and S₂ be k-similar sequences, i.e.

ICS_L(S,, S₂) < k In such circumstances, sequences Si and S₂ share at least N_q q-grams, where N_q = (L + 1 - (k + l).q) The filter will therefore be useful each time Nq is a positive number. If q is fixed, this means that :

(L + l - (k + l)*q) > 0 i.e.: k < (L + 1 - q)/q i.e.:

P > (100*(L + l)*(q - 1))/L*q according to Definition 4.

Practically, if q-grams having a length q of 10 characters are considered, and if L is large, then the above inequation leads to P > 90 %, a condition for which the filter will be efficient for all practical cases.

d) Multiplicities

In order to filter the two ESTs Si and S₂, their common q-grams are counted by using hash tables. A hash table HT_q(S) associates to every q- grams in a sequence S its number of occurrences, i.e. its multiplicity, in said sequence. The value q is selected small enough to consider a complete hash table without redundancy.

The complexity of such hash table is of the order 0(|S|), where |S| is the length of sequence S, considering that the hash function is incremental and computable in 0(1) for each new character of S. The size of the computer memory for storing such hash table is equal to σ^q, where σ is the size of the alphabet _ . According to the present invention, the filtering operation on two sequences Si and S₂ is done by the following steps :

(i) an initialization of the hash table for sequence Si is done for once, at the beginning of the whole process, with an order of complexity equal to 0(σ^q) ; a second table I_q(Sι) also having an order of complexity 0(σ^q) is also initialized; this table is intended to store a list of indices of the hash table HT_q(Sι), i.e. the positions in table HT_q(Sι) where there is a strictly positive count value;

(ii) sequence Si is parsed and all its q-grams are considered in sequence ; each time a new q-gram, hashed to index i, is discovered, index i is added in the list of positions I_q(Sι) ; the contents of the cell of the hash table corresponding to position i, denoted HT_q(Sι)[i], is incremented so as to indicate that another such q-gram has been identified in sequence Si ; it should be noted here that this step (ii) has an order of complexity of 0(|S ₁1) ;

(iii) Steps (i) and (ii) are repeated for the second sequence S₂, so as to obtain

- a hash table HT_q(S₂) storing the multiplicities of the q-grams of s₂ ; - a table I_q(S₂) that stores the marked indices in HT_q(S₂) ;

(iv) a variable count of a counter is then initialized to 0 ; this variable count aims at storing the number of common q-grams between sequences Si and S₂ ; then, for each index i contained in I_q(Sι) which is also present in I_q(S₂), the variable count is incremented with the smallest of values HT_q(S])[i] and HT_q(S₂)[i] ; at the end of this process, the value of count is the number of common q-grams between sequences Si and S₂; the time complexity of this step is equal to 0(|S1|); (v) cells HT_q(Sl)[iJ of table HT_q(Sι) are then individually reinitialized, using the list of positions contained in position table I_q(Sι); it should be noted here that this avoids going though the whole table HT_q(Sl), as only the cells whose contents had been changed from zero to any positive number are reinitialized; the time complexity of this operation is at most 0(|S1|) ; the same selective reinitialization is performed on table HT_q(S2) using position table I_q(S2), with a time complexity at most equal to 0(|S2|).

Regarding overall complexity, the above filter steps have a time and space complexity 0(σ^q) for the pre-processing time and a complexity 0(|S1| + |S2|) for the main processing.

A simplified example of the above filtering process is shown in Figure 3 of the appended drawings. An exemplary EST sequence Si is equal to "ACGTACACG". The value of q is 2. All sixteen possible q-grams are shown in the Figure. The contents of the one-dimension table HT₂(Sι) is also shown, as well as the contents of table l₂(Sι). This is a variable length, one-dimension table indicating as a list of the positions of hash table HT₂(Sι) having a non-zero value. For instance, the contents of memory cell HT₂(Sι)[5] is "1" in the shown example.

e) Pairwise filtering

Although sequence-to-sequence filtering has been described in the foregoing, practical applications will involve comparison of a first set of sequences SetS = {Si, S , S , ..., S with a second set of sequences SetE = {Ei, E , E₃, ..., E_e}, where each sequence Sj of sequence set SetS will have to be compared with each sequence E_j of sequence set SetE (pairwise comparison).

A first "basic" approach to this problem would be to use the above- described filter in sequence for each pair (S;, E_j).

Considering the sum of lengths of sequences Si belonging to SetS, denoted K_s : K_s = |Sl| + |S2| + ... + |Ss| and the sum of lengths of sequences Ej belonging to SetE, denoted Ke :

K_e = |El| + |E2| + ... + |Es| the "basic" approach as explained above would have an order of complexity equal to 0(σ^q) regarding pre-processing time, and a time and the following space complexity PF:

PF = (e.|Sl| + K_e) + (e.|S2| + K_e) + ... + (e.|Ss| + K_e) = e.K_s + s.K_e at running time, using each time 0(σ^q) memory space.

An improvement to this basic approach would consist in performing the hash processing for once on each sequence E_j of sequence set SetE, and then to compare each S; of sequence set SetS against it (or conversely hashing each S_; for once, and comparing each E_j against it).

In such case, it can be understood that the complexity would become: PF^* = K_e + (e.|Sl| + e.|S2| + ... + _e.|Ss|) = e.K_s + K_e at running time, still requiring 0(σ^q) memory space. It should be noted here that, if SetE is the same as SetS, which is typically the case for bank-to-itself sequence clustering, then the whole filtering complexity would become:

at running time, still requiring 0(σ^q) memory space.

f) Considering sequence blocks

An application of the present invention is the clustering of a large bank of ESTs against itself or against another bank of ESTs. Typically, such sets can be as large as 3 billions of sequences. The above-described improvement would indeed accelerate pairwise comparisons, but its worst case complexity PF would also be its average complexity, since every new sequence Sj is hashed every time, which costs 0(jSi|) in terms of complexity.

In order to fasten the whole process on average, an improvement according to an aspect of the present invention consists in performing filtering between an EST Sj belonging to a sequence set SetS containing s ESTs Si, S₂, S₃, ..., S_s and a preprocessed set SetE containing e ESTs Ei, E_, E₃, ..., E_e. The set SetE is called a "chunk" in the following.

The underlying idea is to become less dependent of the real size of the ESTs of and to depend more on the number of common q-grams, that is much smaller on average. Figure 4a diagrammatically illustrates the whole comparison workflow, where it can be seen that a first bank Bi of sequences Ej, E₂, ... E_e is subdivided into a plurality of chunks SetEj, SetE₂, etc., while each sequence Si, S₂, ..., S_s in a second bank B₂ is individually used as a query sequence against each chunk, as shown.

If banks B] and B₂ and each comprise are the same sequences Ei, E₂, ... E_e, then the symmetric character of q-gram filtering as described above is such that a comparison triangle is sufficient, as diagrammatically shown in Figure 4b.

In order to achieve this aim, the multiplicities of q-grams of the sequences contained in a given chunk SetE are practically determined at the chunk level by building in a working memory a Huge Hash Table HHT. As will be illustrated in the following, such hash table associates to each possible q-gram a list of the ESTs contained in SetE in which this q-gram appears, together with its number of occurrences (i.e. its multiplicity) inside each concerned sequence. The size of this table HHT is equal at most to e.σ^q, where e is the number of sequences in chunk SetE.

A sequence S is compared to the contents of the HHT table by using a one-dimension table D having a size e, where e is the number of sequences in chunk SetE, each cell being initialized to 0 everywhere. Firstly, the sequence S is parsed and the hash table HT_q(S) and index table I_q(S) as described in the foregoing are built. Then, for each index i in table I_q(S), which designates a given q-gram w_i; the process goes through the list of couples of values (z, k(i, z)) contained in a portion HHT[i] of table HHT, where z is the order the sequence in chunk SetE (i.e. it designates a particular sequence E_z among sequences Ei, E₂, ..., E_e) and k(i, z) is the number of times the considered q-gram W; appears in sequence E_z. For each value of z, it is determined which of the values stored in cell HT_q(S)[i] and in cell HHT[i, z] is the smallest one, and this value is stored as D[z] in a corresponding cell of table D which is allocated to sequence E_z. It should be noted here that the structures of tables HHT, HT_q(S) and I_q(S) is such that all counts which have appeared to be zero (absence of considered q-gram) do not need to be accessed and processed, which significantly contributes to efficiency.

The loading of cells D[z] in table D is incremential, i.e. as soon as a new value is to be loaded in a given cell D[z], it is added to the value previously stored in the same cell.

At the end of the whole process, each value D[z] in table D represents the number of common q-grams between sequences S and E_z, for all values of z comprised between 1 and e.

Then, all sequences E_z that show a value D[z] which is greater than a given threshold N_q are selected for subsequent actual comparison with sequence S (such as k-similarity determination as described in the foregoing). Figure 5 illustrates in a simplified manner the above process. A huge hash table HHT is build for a set of sequences SetE = {Ei, E₂, E₃, E₄} having the following contents:

E1 = "ACGTACACG"

E2 = "AAGTAAAGGT

E3 = "CCGCCA"

E4 = "AAGTACTCG"

The value e is therefore equal to 4 and as in the previous example, q is set to 2.

The contents of table HHT for the chunk SetE appears in Figure 5. For each 2-gram W;, a memory cell HHT[i, z] is filled in table portion HHT[i] each time a sequence E_z having at least one such 2-gram inside it. For instance, the 2-gram "AA" is found three times in sequence E2 and once in sequence E4, so that two corresponding cells are filled under the "AA" 2-gram position. Also particularly enhanced in Figure 5 are: - the fact that cell HHT[i, z] for i = 7 and z = 4 contains 4 as value of z and 1 as value of k(i, z); - the three cells contained in portion HHT[i] for i = 12, with the last cell containing a value of z equal to 4 and a value of k(i, z) equal to 1.

Turning now to Figure 6, a set of sequences SetS = {SI, S2, S3, S4} is provided. It the present example, sequences S1-S4 are identical to sequences E1-E4, respectively, which reflects a "bank-against-itself filtering. After table HHT has been built as described above, sequence S_j is compared to table HHT in Step 1. For this purpose, hash table HT (Sι) and index table I₂(Sι) are built. The sequence Si being the same as for the example shown in Figure 3, these tables are identical to those shown in Figure 3.

In addition, a table D having an initial contents of {0, 0, 0, 0} is created.

The process goes as follows: - as first index i in table I₂(Sι) is "2", portion HHT[2] of table HHT is parsed, but sequence Ei is skipped (since Ej is the same as Si, and since a sequence-against-itself filtering has no interest);

- sequence E₄ (z = 4) is identified there as having one (k(2, 4) = 1) occurrence of 2-gram w₂ (being "AC"); - a determination is made of the minimum value between HT₂(Sι)[2] (value 3) and HHT[2, 4] (value 1), so that D[4] in table D is loaded with value 1;

- as the end of portion HHT[2] in table HHT has been reached, the next index (i = 5) in table I₂(Sι) is read; - portion HHT[5] (corresponding to 2-gram w₅ = "CA") in table HHT is then parsed; sequence Ei being skipped, a sequence E₃ is identified there as having one occurrence of this 2-gram; the minimum value between HT₂(Sι)[5] (value 1) and HHT[5, 3] (value 1), i.e. 1, is loaded as D[3] in table D; - the process is repeated and the contents of cells D[2], D[3] an D[4] are progressively incremented with the minimum values determined as just explained. At the end of the process, it is easy to understand that table D will have the following contents: {X, 2, 2, 3} (X means that the contents is not significant, as it would correspond to a sequence-against-itself filtering).

By the same routine, S₂ = E₂ is used as query sequence in Step 2 and will lead to a table D having the following contents: {X, X, 0, 4} (the first X means that the contents is not useful, as Si (=Eι) had already been considered against E₂ (=S₂) in the first run, and the second X means that S₂ (=E₂) does not have to be considered against itself).

S₃ = E₃ is then used as query sequence in step 3 and will lead to a table D having the following contents: {X, X, X, 1 }.

Finally, it is useless to use S₄ = E₄ as query sequence since it has already been considered against all other sequences.

Figure 6 also illustrates an application of the above-described filtering process to the clustering of sequences Si, ..., S₄ here again considered as identical to sequences Ei, ..., E₄. It is supposed that each of these sequences originally belongs to a different original cluster, respectively CLj, CL₂, CL and CL₄, as illustrated in Figure 6.

First of all, it should be noted that the above filtering process is followed by a k-similarity determination process designated as "FWSB", having as parameters L=5 (window length) and k=l (maximal number of substitutions) - cf. supra. A practical implementation of this process will be described at a later stage.

From the above-described Lemma 1, N_q, i.e. the threshold for the filtering process, can be computed:

N₂ = (5 + l - (l + l).2) = 2

As a consequence, every cell of every table D which will contain a number at least equal to 2 will reveal that the considered sequence couple passes the filter.

The table D as obtained above with sequence Si indicates that Sι=Eι passes the filter with all three sequences E₂, E₃ and E₄, which implies that all couples (Si, E₂), (Si, E₃) and (Si, E₄) will proceed through the k- similarity determination.

Figure 6 shows that only (Si, E₂) and (Si, E₄) pass the k-similarity determination, so that, sequences Sι=Eι, E₂ and E₄ now belong to the same cluster CLι₂₄.

The table D obtained with query sequence S₂ indicates that couple (S , E₄) passes the filter. It is now assumed that the k-similarity determination process on this couple determines that k-similarity is achieved. However, since S ⁼⁼E₂ and E₄ already belonged to the same cluster (cf. supra), there is no cluster rearrangement required. Finally, table D obtained with query sequence S₃ reveals that sequence pair (S₃, E₄) does not pass the filter. Therefore, this pair is not directed to the k-similarity determination process process, and the cluster arrangement is left as such.

g) k-similarity determination process

In a practical embodiment of the present invention the k-similarity determination process is based on the well known "BLAST" method, which is described for instance in the paper "Basic Local Alignment Search Tool", cf. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool", Journal of Molecular Biology, 215:403-410, 1990).

More particularly, the present invention preferably uses a slightly modified implementation of BLAST method which is known per se: practically, the process comprises performing the BLAST method among the sequence pairs which have passed the q-gram filtering, so as to provide a fast search of the longest identity regions shared by each sequence pair. A mere checking of the BLAST output is then performed so as to isolate, in such regions, at least one region which satisfies the ICS determination with given parameters L and P (cf. supra).

h) Numerical values To get an idea of the gain which can be expected with a filtering according to the present invention, Table 1 hereinbelow gives numerical values for sets of sequences SetS and SetE each having e = s = 100,000 sequences, when the size n of each sequence is respectively 200, 300, 500 and 1,000. The expected filtering time in a symmetric Bernoulli model (on nucleic sequences, i.e. with an alphabet size σ equal to 4) has been computed for:

(a) a single chunk, assuming that the memory space is large enough for the whole hash table HHT (column "BF" in Table 1);

(b) two chunks of 50,000 ESTs each (column "BF₂");

(c) four chunks of 25,000 ESTs each (column "BF₄").

Finally, the computation of PF*/BF in the rightmost column of Table 1 gives an estimation of the expected gain with a filtering according to the present invention.

i) Single link clustering

Figure 7 of the appended drawings illustrates the global mechanism for the comparison of two ESTs S and E belonging respectively to two different sets of sequences SetS and SetE, where each set initially forms a respective cluster of sequences.

The sequences S and E are subjected to a clustering process in two main steps:

- first the sequences S and E are subjected to q-gram filtering by using chunks as described in the foregoing;

- when the two ESTs pass trough the filter, i.e. the count for (E, S) is at least equal to N_q, they are compared with each other e.g. with the above- described FWSB process so as to determine whether they are k-similar. If so, the process considers that sequences S and E should belong to the same cluster, so that clusters SetS and SetE are merged together into a cluster SetX.

The links between the ESTs belonging to such a resulting cluster (SetE SetS) should now be considered. With the systematic approach as described in the foregoing, according to which every sequence is tested against each other one, all the ESTs inside said resulting cluster have been tested against the others for similarity, and certain couples have been declared similar while other couples have been declared as not similar. This results from the transitivity implied by cluster merging.

In the meantime, the following observations can be made: (a) a user is generally interested in some specific clusters, but not all clusters: for instance, a user will merge his own (proprietary) sequences with public ones and be interested in the clusters where at least one of its proprietary sequence appears, (b) studying a cluster in depth requires to re-compute links between all ESTs in the set with a tighter or stricter similarity notion and a fine overlap of weights. In consequence, it is not necessary to compute all the similarity links between the ESTs of the same cluster.

According to an improvement of the present invention, a so-called "single link" clustering is performed, i.e. if two ESTs already belong to the same cluster by transitivity, the actual similarity test between both is not performed.

More precisely, a so-called "Union-Find algorithm" as disclosed e.g. in the paper "Efficiency of a good but non linear set union algorithm, R.E. Tarjan, Journal of the ACM, 22(2), pp. 215-225, 1975, is used for testing whether two sequences are already in the same cluster or not before performing the time and memory space consuming comparison of the two sequences by the FWSB similarity process as described in the foregoing.

It could be demonstrated that the Union-Find algorithm is not linear, but in practice is very close to being linear. From there, it could be shown that the tests for determining whether two sequences belong to the same cluster (i.e. whether they have the same cluster attribute) requires at most an additional time or the order of log* (n).

For instance, and turning back to Figure 6, this test implies that the FWSB process on the pair (S₂, E₄) will not be performed as S₂(=E ) and E₄ have previously been grouped into the same cluster. k) Mapping sequences on genomes

The present invention can easily be adapted to map ESTs on a genome G. The basic principle is to split the genome G in blocks having a given length d that overlap with o bases, and to consider all these blocks as the queries for the filtering process. Figure 8 shows such a split chromosome. In order to avoid the risk of undetected k-similarities, the overlap is chosen at least equal to the window length L selected by the operator at the beginning of the process and used in the FWSB k- similarity process. This will be further explained in the following.

At this stage, this approach is similar to the QUASAR prior art mentioned in the introductory part of the present application.

The process according to the present invention is as follows:

- as shown in Figure 8, the genome G is firstly split into blocks Bi, ..., B_m of length d (except last block B_m which may have a different length, d being chosen greater than the length of the longest EST in the set of sequences SetS to which genome G is to be compared, with an overlap of L bases, L being the length of the similarity window;

- a huge hash table HHT is then built for the set of blocks Bi, ..., B_m obtained by this process; - all the ESTs of SetS are individually compared to the blocks Bi, ..., B_m by parsing table HHT as described above; - however, an important feature of the inventive process is that, if a given EST Sj of SetS passes through the filter with at least two contiguous blocks, e.g. with three blocks B_j, B_j+i and B_j+2, then the k-similarity determination is performed between EST Si and a "superblock" SB which is the concatenation said contiguous blocks, i.e. B_j, B_j-n and B_j+2 in the present species, with elimination of the overlap(s);

- all the longest matching positions for which the k-similarity between Sj and the superblock SB are reported.

This is illustrated in Figure 9 of the drawings.

Turning back to the above-mentioned overlap of length L between to consecutive blocks, it has the purpose of avoiding the situation shown in Figure 10 of the drawings, where an overlap o smaller than L has been selected. In such a situation, the occurrence of a sequence lying at the overlap between two consecutive blocks B_k, B_k+i and similar to a query sequence S_k could be missed since none of the blocks B or B_k+ι would have enough q-grams in common with the query sequence to pass through the filter.

The flexibility of the above split-and-reconcatenate approach derives from the use of hash tables, which strongly simplifies the whole process as will be explained in the following:

- first of all, since there is no pre-processing required, a genome can be split into blocks "on the run" with an overlap preferably equal to the window length L selected by the user as a parameter, which allows different, optimal genome splittings depending on the value of L; - secondly, by concatenating again the blocks which have passed the filtering process with a given query sequence, as said blocks initially were, erroneous double matchings are avoided, as matching portions located at the overlap between two blocks will be merged again into a single matching portion.

Many variants and modification can be brought by the one skilled in the art to the present invention. In particular, the filtering and similarity processes can be easily parallelized in a multi-processor environment, whether in a shared or distributed memory architecture.

The following Table II indicates the speed-up achieved by parallel processing in the practical example of a set of 100000 ESTs compared against itself, with each EST having a length greater than 800 and the average length being 927. The computer environment was an AlphaServer ES40/466 with four 666 MHz processors, as available at Compaq.

Table II

Claims

1. A method for filtering a plurality of nucleic or proteic sequences (Ej, ..., E_e; Si, ..., S_s; SetE, SetS) for the purpose of subsequent sequence similarity determination, each position in each sequence belonging to a predetermined alphabet (_ ), and said sequences being stored in a computer memory, the method comprising the following steps:

(a) determining a word length (q) having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet (_ ) a set of possible words (q-grams) (w;) having length q,

(b) storing in a working memory a plurality of first sequences

({Ez}), (c) for each q-gram and for each sequence (E_z), counting the number of occurrences of the q-gram in the sequence,

(d) constructing and storing in said working memory, in association with each q-gram (Wj), a respective count set (HHT[i]) containing, for each member of the count set, an identifier (z) of a first sequence (E_z) having at least one occurrence of the q-gram and an first occurrence count (k(i,z)) of the q-gram in sequence (E_z), as determined in step (c),

(e) counting the number of occurrences of each q-gram (Wj) in a second sequence (S_y) for generating second occurrence counts (HT_q(S_y)[iJ),

(f) storing in the working memory the second occurrence counts (HT_q(S_y)[i]) as determined in step (e) as well as a set (I_q(S_y)) of identifiers (i) of q-grams (wj) for which the counts (HT_q(S_y)[i]) are strictly positive,

(g) for each q-gram identifier (i) contained in set (I_q(S_y)):

(gl) reading the count set (HHT[i]) constructed in step (d) in relation with the q-gram (w;),

(g2) for each sequence (E_z) identified in count set (HHT[i]), determining the minimum value between the corresponding first occurrence count (k(i,z)) and the second occurrence count

(HT_q(S_y)[i]), (h) for each sequence (E_z), determining and storing a common q- gram score (D[zj) with said second sequence (S_y) based on the minimum values determined in step (g2) for said first sequence, and

(i) rejecting sequence (E_z) and sequence (S_y) as similarity candidates for each sequence (E_z) for which score (D[z]) is lower than a given threshold value.

2. A method according to claim 1, wherein steps (e) to (i) are performed for a plurality of second sequences (S_y) with use of the same . count sets (HHT[i]) as constructed and stored in step (d).

3. A method according to claim 2, in which said second sequence • (S_y) belongs to a set of sequences (Si, ..., S_s; SetS) which is identical to said set of first sequences (Ei, ..., E_e; SetE), wherein steps (g2), (h) and (i) are omitted for a first sequence (E_z) which is identical to the second sequence (S_y) and for any first sequence (E_z) which has already been used as second sequence (S_y) in steps (g2), (h) and (i).

4. A method for determining the similarities between a plurality of nucleic or proteic sequences (Ei, ..., E_e; Si, ..., S_s; SetE, SetS), each position in each sequence belonging to a predetermined alphabet (_ ), and said sequences being stored jn a computer memory, the method comprising the following steps:

(a) determining a word length (q) having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet (__) a set of possible words

(q-grams) (w;) having length q, (b) storing in a working memory a plurality of first sequences

({Ez}),

(c) for each q-gram and for each sequence (E_z), counting the number of occurrences of the q-gram in the sequence,

(d) constructing and storing in said working memory, in association with each q-gram (W_J), a respective count set (HHT[ij) containing, for each member of the count set, an identifier (z) of a first sequence (E_z) having at least one occurrence of the q-gram and an first occurrence count (k(i,z)) of the q-gram in sequence (E_z), as determined in step (c), (e) counting the number of occurrences of each q-gram (w_;) in a second sequence (S_y) for generating second occurrence counts (HT_q(S_y)[i])₅

(f) storing in the working memory the second occurrence counts (HT_q(S_y)[i]) as determined in step (e) as well as a set (I_q(S_y)) of identifiers (i) of q-grams (W_J) for which the counts (HT_q(S_y)[i]) are strictly positive,

(g) for each q-gram identifier (i) contained in set (I_q(S_y)): (gl) reading the count set (HHT[i]) constructed in step (d) in relation with the q-gram (Wj),

(g2) for each sequence (E_z) identified in count set (HHT[ij), determining the minimum value between the corresponding first occurrence count (k(i,z)) and the second occurrence count

(HT_q(S_y)[i]),

(h) for each sequence (E_z), determining and storing a common q- gram score (D[z]) with said second sequence (S_y) based on the minimum values determined in step (g2) for said first sequence, (i) accepting sequence (E_z) and sequence (S_y) as similarity candidates for each sequence (E_z) for which the common q-gram score (D[z]) is higher than a given threshold value,

5. A method according to claim 4, wherein steps (e) to (j) are performed for a plurality of second sequences (S_y) with use of the same count sets (HHT[i]) as constructed and stored in step (d).

6. A method according to claim 4 or 5, wherein said similarity determination comprises the following steps:

(k) determining a sub-sequence length (L) lower than the lengths of the sequences (E_z, S_y),

(1) determining a percentage (P) of identities, (m) determining whether the two sequences (E_z, S_y) share a common sub-sequence having length (L) and having a percentage (P) of identical characters.

7. A method according to claim 4 to 6, wherein said plurality of said first sequences (E_z) are constituted by overlapping blocks (B_j) of a long sequence such as a genome (G), and wherein, after at least two steps (i) have accepted as candidates a given second sequence (S) and at least two first sequences (B_j, B_j+j, ...) which are contiguous blocks in long sequence (G), step (j) is performed on the second sequence (S) and to a supersequence (SB) constituted by the first sequences (B_j, B_j+i, ...) re- concatenated with elimination of overlap (L).

8. A method according to claims 6 and 7 taken in combination, further comprising an initial step in which a user selects a sub-sequence length (L) for similarity determination, and a step of constructing said overlapping blocks with an overlap length (L) equal to the sub-sequence length.

9. A method for clustering a plurality of nucleic or proteic sequences (Ei, ..., E_e; Si, ..., S_s; SetE, SetS), each position in each sequence belonging to a predetermined alphabet (_ ), said sequences being stored in a computer memory, the method comprising the following steps:

(a) determining a word length (q) having a value substantially lower than the number of characters in said sequences, so as to define from this word length and from the alphabet (∑) a set of possible words

(q-grams) (Wj) having length q, (b) storing in a working memory a plurality of first sequences

({E_z})_; (c) for each q-gram and for each sequence (E_z), counting the number of occurrences of the q-gram in the sequence,

(d) constructing and storing in said working memory, in association with each q-gram (w,), a respective count set (HHT[i]) containing, for each member of the count set, an identifier (z) of a first sequence (E_z) having at least one occurrence of the q-gram and an first occurrence count (k(i,z)) of the q-gram in sequence (E_z), as determined in step (c),

(e) counting the number of occurrences of each q-gram (w_;) in a second sequence (S_y) for generating second occurrence counts

(HT_q(S_y)[i]),

(g) for each q-gram identifier (i) contained in set (I_q(S_y)):

(gl) reading the count set (HHT[i]) constructed in step (d) in relation with the q-gram (Wj),

(h) for each sequence (E_z), determining and storing a common q- gram score (D[zj) with said second sequence (S_y) based on the minimum values determined in step (g2) for said first sequence, (i) accepting sequence (E_z) and sequence (S_y) as similarity candidates for each sequence (E_z) for which said common word count is higher than a given threshold value,

10. A method according to claim 9, wherein steps (e) to (k) are performed for a plurality of second sequences (S_y) with use of the same count sets (HHT[i]) as constructed and stored in step (d).

1 1. A method according to claim 9 or 10, further comprising the following steps: (1) identifying possible other sequences which where previously sharing a common cluster attribute with either of the sequences determined as similar, and

(m) attaching to said other sequences said common cluster attribute.

12. A method according to claim 11, further comprising, before step (j), the steps of : checking whether the similarity candidates have a same cluster attribute, if not, performing step (j), if so, skipping step (j) and passing to the next first sequence (E_z) or second sequence (S_y).