WO2001061557A2

WO2001061557A2 - Sequence matching

Info

Publication number: WO2001061557A2
Application number: PCT/GB2001/000631
Authority: WO
Inventors: Michael Turner; Simon Moss; Paul Zanelli
Original assignee: Pc Multimedia Limited
Priority date: 2000-02-16
Filing date: 2001-02-16
Publication date: 2001-08-23
Also published as: WO2001061557A3; AU2001233858A1

Abstract

A method of determining the degree of match between a plurality of strings of signifiers. The method comprises the steps of: (i) identifying a possible signifier match between two signifiers in different strings; (ii) determining an upper bound to the probability of any of the possible solutions containing the possible signifier match; (iii) comparing the upper probability bound for the possible solutions with a threshold probability; and (iv) eliminating from the set of possible match solutions those solutions including the possible signifier match if the upper probability bound for the possible solutions is less than the threshold probability.

Description

Sequence Matching

The present invention relates to matching sequences of signifiers, and in particular to a method and system for determining the degree of match between multiple sequences of signifiers.

The particular application of the invention in the field of DNA or protein sequence alignment will be discussed merely by way of an example of an application of the invention.

The simultaneous alignment of DNA or protein sequences is now an important part of molecular biology. Multiple alignments are used to (1) find diagnostic patterns to characterise protein families, (2) to detect or demonstrate hόmology between new sequences and existing families of sequences, (3) to help predict the secondary and tertiary structures of new sequences and (4) as an essential prelude to molecular evolutionary analysis.

The rate of appearance of new sequence data is increasing and the development of efficient and accurate automatic methods is, therefore, of major importance.

The task of multiple sequence alignment can be understood with reference to figure 1, which shows three sequences of DNA data, each comprising 8 signifiers representing 8 bases, The goal is to find the best possible alignment between all three strings, given some model of the similarity between strings, which may include character insertions, character deletions and substitutions, and to provide some indication of the degree to which all three strings match each other. In order to align just two sequences, it is standard practice to use dynamic programming. This guarantees a mathematically optimal alignment, given a table of scores for matches and mismatches between all characters and penalties for insertions or deletions of different lengths.

Attempts at generalising dynamic programming to multiple alignments have been limited to small numbers of short sequences. This is because of the combinatorial nature of the problem. For example, for much more than eight or so proteins of average length, the problem is incomputable given current computer power. Therefore, all current methods capable of handling larger problems in practical time scales make use of heuristics.

Currently, the most widely used approach for multiple sequence alignment is the progressive method of Feng and Doolittle. This exploits the fact that homologous sequences are related through evolution. It is therefore assumed possible to build up a multiple alignment progressively by a series of pairwise alignments (following the branching order in a p ylogenetic tree) .

First the most closely related sequences are aligned, gradually adding in the more distant ones.

In simple cases, the quality of the alignments may be good. In more difficult cases, the alignments may give starting points for further automatic or manual refinement.

The major reason for the limitations of this prior art approach is the limited pattern recognition it employs. Multiple sequence alignment is realised through a chain of pairwise comparisons. Only the best-guess alignments early on in the chain are passed on in processing.

That is, best-guess information is passed up the processing chain. The success of this approach depends critically on obtaining good initial alignments, but this is not possible in general, and this approach does not guarantee that the best global alignment will be obtained, as that required all the sequence data to be considered. In this specification the term global is used to indicate that all possible eventualities are considered. Hence, for instance, the global solution is the best solution out of the set of all possible solutions.

The result of utilising a ^λbest-guess' approach is that errors (which correspond to misalignments of the sequences compared to the best alignment of the sequence) which are introduced early on necessarily pass on to subsequent stages, causing mistakes there and thereby leading to a non- optimal solution. Attempts to improve on the recovered alignment may be subsequently made, but in essence, these simply make minor refinements to the current best-guess solution (i.e. performing a gradient-based search around the local solution) and are incapable of recovering from non- trivial errors .

The present invention relates to a new approach to determining the degree of match between multiple sequences of signifiers, which is fast and gives good alignments of sequences under a wide range of realistic conditions.

According to a first aspect of the present invention, there is provided a method of determining the degree of match between a plurality of strings of signifiers, comprising the steps of:

(i) identifying a possible signifier match between two signifiers in different strings; (ii) determining an upper bound on the probability of a any global match solution containing the possible signifier match;

(iii) comparing the upper probability bound with a threshold probability; and (iv) eliminating from the set of global match solutions those solutions including the possible signifier match if the upper probability bound for the possible match solution is less than the threshold probability.

Given the available resources, a suitable means of computing an upper bound probability for regions of the solution space is defined. Through an iterative process, regions with low upper bounds are eliminated by comparison with a threshold, and then effort is re-applied to those regions that remain. As more and more of the solution space is eliminated, so the size of the regions covering the remaining space can be reduced without compromising resources, and more accurate upper bounds can be evaluated. In this way, the optimal solution can be identified through a process of exclusion.

The method and system are particularly suitable for determining the degree of matching between multiple sequences of DNA or proteins. However, the invention is not limited to DNA or protein sequence matching, and can be applied in any field in which it is desired to determine the degree of match between multiple sequences of signifiers which represent either a physical entity (e.g. a base) or a non-physical entity (e.g. a word) . The term signifier is considered to encompass all ways of representing an item in a sequence of items. For instance the invention can be used to determine the degree of match between scan lines in an image, by matching rows of pixels in one image to rows of pixels in a second images in order to determine correspondences between the tow images . A suitable signifier in this case would be a measurement vector, giving the displacement of a pixel from an origin.

Strings of any type of signifiers can be matched because the sequence alignment invention is an order-preserving string matcher. So any suitable signifier can be used provided it can be used to determine the position of an element in the string of elements. Further the invention can use any model for the similarity between two signifiers in different strings .

Preferably, the upper bound is determined according to Bayesian probability theory.

Preferably, the step of identifying a possible signifier match is repeated so as to identify all possible signifiers matches between a first signifier in a first string and each signifier in each of the other plurality of strings.

Preferably, the step of identifying a possible signifier match is repeated for each signifier in each of the plurality of strings so as to identify all possible signifier matches. In this way, the method is applied simultaneously to all possible sequence alignments simultaneously. The method ensures that all plausible alignments are examined. Processing being effectively the task of eliminating implausible match schemes so as to hone in on the best match scheme, ie the solution, through a process of exclusion.

Preferably, the method includes the step of repeating steps (iii) and (iv) for all the possible signifier matches identified.

Preferably, only an identical signifier is a possible signifier match.

Preferably, more than one signifier is a possible signifier match.

Preferably, the method includes the step of recalculating the threshold. In this way different regions of phase space can be investigated more accurately, if the difference between local solutions is not sufficient to determine the best solution.

Preferably, the method includes the step of determining whether an acceptable number of possible global solutions have been determined. Once a tractable number of possible solutions have been determined, the main body of the method can be terminated.

Preferably, the method includes the step of recalculating the threshold if an acceptable number of possible global solutions have not been determined. This helps to allow the best solution be identified by enabling local solutions to be distinguished.

At least two of the plurality of strings can have the same number of signifiers. Each of the plurality of strings can have the same number of signifiers. Each of the plurality of strings can represent a sequence of DNA, or a protein or proteins. Each signifier can represent a base.

According to a further aspect of the invention, there is provided a computer system for determining the degree of match between a plurality of strings of signifiers, comprising processing means, and the processing means operating on data representing a plurality of strings of signifiers, to:

(i) identify a possible signifier match between two signifiers in different strings;

(ii) determine an upper bound to the probability of any global match solution containing the possible signifier match;

(iii) compare the upper probability bound for the possible match with a threshold probability; and

(iv) eliminate from the set of global match solutions those solutions including the potential match if the upper probability bound for the possible match solution is less than the threshold probability.

According to further aspects of the invention there are provided a computer program computer program code and a computer readable medium bearing instructions, which when executed by a computer carryout the method of the invention or provide the system of the invention.

An embodiment of the invention will now be described in detail, by way of example only, and with reference to the accompanying drawings, in which: Figure 1 shows a schematic diagram illustrating the matching method according to the invention being applied to three stings of signifiers representing DNA sequences ; Figure 2 shows a flow chart illustrating the method according to the present invention;

Figure 3 shows a sequence of diagrams illustrating the probability of all the possible solutions; Figure 4 shows pairs of strings of signifiers illustrating aspects of a matching model used in the method; and

Figure 5 shows a schematic diagram of a sequence matching computer system according to a further aspect of the invention.

Although the invention is described with reference to its application in the field of DNA sequence matching, it will be appreciated that the invention is applicable to any situation in which it is required to determine the degree of the best match between a plurality of strings of signifiers representative of physical or non-physical entities. The same items in different Figures share common reference numerals unless indicated otherwise.

Firstly a general discussion of the method and system for determining the best match between a plurality of strings of DNA bases is provided, before a more detailed mathematical description.

Figure 1 shows representations of three DNA sequences a_{l t} a₂ , a₃ each comprising a string 110, 120, 130 of eight bases represented by the signifiers, or characters, c, t, a and g. The problem addressed is to determine the actual degree of match, similarity or alignment, between the three DNA strings, from all the possible matching schemes that the rules of a model permit.

In this case, the model of how DNA behaves has the rules that identical bases match, ie. c=>c , tot, aoa, and gog. These are all possible matches. The rules of the DNA model also permit substitutions of some bases by others, so that in this case, the match aog is also a possible match. The model also permits for spacings in the sequences so that a possible match base in a second string does not have to be in the same position as a base in a first string in order for there to be a degree of matching. However, the further out of position the possible match base, the lesser the degree of matching. The strength of matching as a function of the position of the base is a feature of the model used.

At the onset of processing all possible matches are available to the system as possibilities, bar those eliminated due to prior knowledge: eg if we know that two signifiers cannot match they may be excluded from consideration .

With reference also to Figure 2 , as a first step 210 in the matching method of the invention, all possible matching schemes that are allowed within the rules of the model are identified. With reference to Figure la, the left most signifier, or character, representing a c base in the first string of DNA ai is considered and all its possible matches with all the bases in the second string of DNA a₂ are identified. As the only possible match is to another c base, only matches to the first and second bases in the second string are identified. The match to the first c base in the second sting is stronger than the match to the second c base, as their relative positions in the strings are the same.

Once all the possible signifier matches have been identified for the first base, the possible matches for the next base in the first string are identified. This is then repeated for each base in the string until all the possible signifier matches for all the signifiers in the first string with the second string have been identified. Note that owing to the allowable substitution aog, the penultimate base in the first string has possible matches to both a and g bases in the second string.

This procedure is then repeated for the second DNA string a₂ with respect to the third DNA string a₃ so as to identify all possible base matches between the second and third strings.

This procedure is then repeated for the third DNA string a₃ with respect to the first DNA string a_x so as to identify all possible base matches between the third and first strings, which are not shown in Figure la for the sake of clarity.

All possible string matching schemes, ie combinations of one to one mappings between the bases of each string for all three strings, can be constructed from the possible base matches. The task is then to determine which of all of the possible matching schemes, has the greatest degree of match between all three strings; ie to determine from the set of all possible solutions the solution having the greatest degree of matching, which therefore is the best match between all the strings within the matching rules of the model . The next step 220 is to determine an upper bound on the probability of a solution including a particular one of the possible base matches. Figure 3a shows a diagram 310 illustrating steps of the method. The left ordinate axis represents a measure of probability, the right ordinate axis represents the degree of matching between the strings, and the abscissa represents the set of all possible matching schemes, ie the set of all possible solutions to the matching problem. The line 315 shows the degree of matching between the strings for a particular matching scheme. The best match, ie the solution to the matching problem, is given by the matching scheme 320.

Consider the possible match of the last t base in the first DNA string with the first t base in the second DNA string.

This possible match has associated with it a set of matching schemes or solutions 330. Although shown as occupying a connected, single part of the set of possible solutions, for the sake of simplicty, the set of solutions may well be distributed about the set of all possible solutions. An upper bound 335 for the probability of the set of matching schemes 330 that include that particular possible signifier match being the solution is calculated.

Next, a threshold probability 340 is calculated in line with Bayesian probability theory. The upper bound 335 is compared 230 with the threshold. As the upper bound is less than the threshold probability, all possible matching schemes, or solutions, including that possible match can be eliminated 345 from the set of possible solutions 240, as illustrated in Figure 3b. Processing is the task of eliminating implausible matches and seeing how this affects the possible solutions. This is an iterative process. For example, elimination of an unlikely possible signifier match in itself leads to the elimination of other possible matching schemes since they were dependent upon these that signifier match for their existence in the first place.

The procedure is then repeated 250 for each of the possible signifier matches identified. Consider the possible match of the second c in the first string and the first c in the second string. There is a set of possible matching schemes 350 including this particular match. An upper bound 352 for the probability of a solution containing this match is calculated, and compared with the threshold probability 340. However, as illustrated in Figure 3b, as the upper bound on the probability of the solution containing this possible match is greater than the threshold probability, this set of possible solutions 350 are retained.

The remaining possible signifier matches are processed, for instance the set of solutions 356 containing the identified possible match of the last g in the second string and the first g in the third string, has the upper bound on its probability 358 calculated, compared with the threshold, and the set of solutions including that possible match eliminated.

Eventually the position illustrated in Figure 3d will be reached, in which all possible solutions having an upper probability bound lower than the threshold have been eliminated. This can be schematically represented by Figure lb. A greatly reduced subset of possible signifier matches remains out of the original set of all possible signifier matches. It is determined 250 whether sufficient possible solutions have been eliminated for the remaining solutions to be evaluated individually in an acceptable time by the processing power available. The number of possible signifier matches can be reduced further by repeating the method, with a recalculated threshold probability 360. Hence when the signifier match of the first c in the first string and the second c in the second string is considered again, the upper bound on the probability of that set of solutions is less than the recalculated threshold 360 and so that possible signifier match can be eliminated.

Eventually a computationally tractable set 362 of possible signifier match schemes, or solutions, is identified out of all the possible solutions. This set of solutions 360 is exhaustively searched 260 by any conventional technique in an acceptable amount of time by the processing power available so as to determine the matching scheme 362 having the greatest degree of match between all three stings. The results of the process are then saved 270, to provide an indication of the matching scheme, as illustrated in Figure lc, having the greatest degree of matching, or alignment, between the strings and a measure of that degree of matching.

Through this iterative process of eliminating implausible matches, good global alignments are identified by exclusion. This is in contrast to all existing methodologies that attempt to identify a global solution directly through the propagation of best-guess pairwise alignments.

The approach uses pattern recognition based upon three key conditions :

1. Calculations are underpinned by Bayesian probability theory. 2. The method requires that all solutions (i.e., all possible alignments) be assessed.

3. Processing is resource-driven such that the calculations that can be performed are constrained by the memory available and the speed of operations required, as defined by the operator.

Mathematical aspects of the method will now be described in greater detail, with particular reference to Figures 1 and

4. Consider a set of K strings of characters a={a^x,. ,a^k} which may represent K sequences of DNA or protein sequence data. The goal is to derive the best global alignment for the strings given some model of similarity, which may include character insertions, character deletions and substitutions .

From conditions 2 and 3 an holistic, probability theory approach is utilised, requiring:

(1) T=arg max _TεΦP(T*=T| a)

where T={T_αβ_j, for all α,β ε K, for all i ε L_<_,j ε L_β} , is the binary match matrix for the strings, Φ is the space of possible global solutions for T, and L_α is the length of string α. The two characters indexed i in string α and j in string β are matched (aligned) in the global solution if and only if T_αβij=l.

This aim is not evaluated directly, i.e., by actively searching for and refining solutions within the global solution space, this being the approach of existing gradient-based techniques. Rather, the best solutions are determined indirectly, by eliminating bad solutions from Φ, In doing so all of the solution space is implicitly examined, as required by condition 2, as follows.

Solutions are grouped together since examining each individual solution in isolation is computationally intractable in general, and thereby breaking condition 3.

Consider all solutions that contain the individual match T_αβi₃=l, say. That is, the strings a_α and aβ are aligned and fixed at a_α_<->aβ-_].

The maximum probability of any one of these solutions is

(2) U(T_αp_l]=l)=max _T'εΦ' P (T_αβl3=l, T' | a)

where T' denotes the matches for all characters excluding the pair under consideration, and Φ¹ is the space of possible solutions for this set.

Now any group of solutions whose lowest upper bound probability is below some known lower bound value, L^<n) , cannot contain the optimum solution. Therefore, we can eliminate these groups from consideration. The rule for T_αβ1D at some iteration time n is :

eliminate any solution containing the match T_αβi:)=l if

(3) U(T_αβl3=l)< L^<n)

By eliminating this set of solutions the size of the space which needs to be considered at the next time step is effectively reduced. That is, the new search space at time n+1, φ^{n+1), will not contain these solutions, which will affect future processing. In relation to the alignment, if the possibility T_αβ_ι:)=l is excluded, then this will affect the upper bound on other matches at the next iteration.

The computation of the upper bound has not yet been defined, and in general may be computationally expensive, thereby breaking condition 3. The solution is to identify quantities of the form Y⁽ⁿ⁾ such that Y⁽ⁿ⁾>= U⁽ⁿ⁾ which can be computed in a given time and using a given amount of memory. The elimination rules then become:

eli-Tiiπate any solution containing the match T_αβι:ι =l if

(4) Y⁽ⁿ⁾ (T_αβl]=l)< L⁽ⁿ⁾

Y ^(n> is evaluated by combining Bayesian probability theory with rules of inequality. Its form may change over the iterative cycles in order to accommodate condition 3. For example, at the onset of processing Y⁽ⁿ⁾ may be coarsely and quickly evaluated, but provided it obeys Y^(n>>= U⁽ⁿ⁾ then only bad solutions will be eliminated. Towards the end of processing when only a few solutions remain, a more sophisticated and computationally intensive means of computing Y may be employed, such that Y⁽ⁿ⁾ approximates U⁽ⁿ⁾ provided condition 3 is not violated.

Processing will continue until no solutions fall below the relevant threshold. At any time processing may be re-started by heuristically increasing the threshold, or alternatively, the remaining solutions may be recorded and processed in some manner .

In summary, the global solution space Φ is iteratively reduced by identifying and eliminating implausible matches. Elimination is achieved by comparing an upper bound on the probability of any global solution containing a match against a threshold. Computational overheads are addressed by using a coarseness function Y that, whilst not necessarily delivering the lowest upper bound, is sufficient for identifying inappropriate regions of the solution space.

A detailed mathematical description of the application of this invention to the alignment of DNA or protein sequence data, and in particular, to the computation of upper bound quantities for match solutions. The development leads to relatively simple expressions for these upper bound quantities .

Consider the upper bound quantity

( 5 ) U ( T_αβl3 = l ) =max _τ. _{ε Φ}. P ( T_αβl3 = l , T ' | a )

i.e., the maximum probability associated with any global solution containing the individual match a_αι<->a_β-,.

Exact development is intractable due to the complex interactions between pairs. However, obtaining an upper bound is straightforward. By noting that max_X:Y P(xεX,yεY) is upper bounded by max_x P(xεX) max_γ P (yεY) , it follows that

(6) U(T_αβ_1D =l,T' |a)<= maxτ< αβ P ( T_αβι_D = l , T ' αβ | a ) π

P(T_γβ,T_αβ_ι:) = l|a)max_Tαγ P (T_αγ, T_aPl-, = l | a) π _γ!._α π δ!=β,δ_<y maxTγδ P(T_γδ,T_αβ₁₃ = l I a)

where T_αβ is the set of matches between strings α and β. The expression in (7) can be expanded to make explicit contribution from the characters under consideration:

(7) U(T_αPl3=l,T' |a)<= max x_'αβ P(T_aPl3 = l,T'_aP|a) π _γ!=αιβ max _kεLαιγ max _T'αγ P (T_αγιk=l, T' _αγ| a) max _kεLβ]γ max _τ-_γβ P (T_yPk- = l , T'_γβ | a)

Li γ!=_α IT δ!=β,δ<γ I^ia _kε__αιγ, ε β]δ max _T'γδ

where L_αβ_ is the list of possible matches in sequence β for character i in string α.

According to (7) an upper bound probability for a global solution containing the individual match T_αβi;)=l is evaluated by considering the alignments for pairs of sequences. In the method, the assessments are made simultaneously over all possible pairs of sequences, and further, that they are used to compute an upper bound on all plausible global solutions rather than to assess and refine a few sub-optimal solutions.

The quantity of interest in (7) is of the form max T_'αβP

, that is, the maximum probability of a pairwise solution given an individual match.

In order to develop this quantity, consider all possible alignments of strings α and β that contain the individual match T_αβ1D=l. That is, the strings are aligned and fixed at a_αι<—^bp₃, and it is necessary to compute the upper bound on any pairwise solution that contains this match. Using the notation

(8) u(T_αβlD =l)=max τ'oβ P (T_αβ_-,=1 , T' _αβ | a)

By applying Bayes ' rule and considering substrings on the left and right of the line defined by T_apl-,=1, this can be rewritten as:

(9) u(T_αρ₁₃ = l)=L(T_αβl] = l) R(T_αβι: = l) P (T_αβl3 = l | a) /p (a)

where L(T_αβι_D=l) is the contribution from characters to the left-hand side and is given by

(10) L(T_αβι₃ = l) = max

P(T_αβι_j = l,T*^" _ββ₁₃ I a)

where ^_αβ..-, is shorthand for these assignments, and likewise R(T_αβl]=l) is the contribution from characters to the right- hand side given by

(11) R(Tαβ_1D = l) = max _τ ^→αβι_D R( αβι₃ = l,T^"*_αβ_1D I a)

The contribution made by the left hand side, L(T_αβl-,=l) is developed. Development for the right hand side is immediate by analogy. Consider when a match on moving leftwards from a_αι<—>a_β:) might next be encountered. In doing so it is necessary to take into account that gaps may be introduced in the strings .

With reference to Figures 4a,b,c & d there are four cases: (i) no gap to the match, i.e., a_α>1-ι<—_>b_P -ι_/ (ii) a gap in a_α but not in a_β, (iii) a gap in a_β but not in a_α or (iv) no further match.

These cases are exhaustive and mutually exclusive. It is therefore possible to consider each in turn and look for the maximum response in (8) .

Case 1: No Gap

In this case the hypothesis

is made, so the contribution from the left hand side is

(12) L(T_{α l}-_j = l) = max T α i_] P (T_α ,ι-ι,]-ι = l, T cu3,_.-i,_-i I T_αpl] = l, a)

which, by assuming that information about the match a_α>1^--_b_βιD is redundant if a nearer match is to hand, i.e., _α)1._!<—>b_{P -}ι, gives

(13) L(T_αβ__D=l) = max ^■.*^" _<_&_.-_., _-ι P (T*^" _αβ,ι-ι,₃-ι | T_αβ,₁-ι._D-₁=l , a) P (T_αβ,ι- ι,]-ι=l I T_αPl-,=l,a)

which using (8) leads to the recursive rule:

L(T_αi_]

, a_aι-_ a_p-,-i)

Case 2 : Gap in a_p In this case it is necessary that the nearest match for a_αι-1 exists at some point k<j-l in a_p. It is necessary to consider all possible non-zero gap lengths:

(14) L(T_αPl3 = l) = max _{k ]}-ι max ^•_^• ,_ -ι,k

which becomes

(15) L(T_αPi₃ = l) = max _k<D-ι max τ^<"αβ,_.-ι,k P(T^_αβ,ι-ι,k | T_αβιl_ _lik=l , a) P (T_ββ,₁-₁,k=l I T_aPl3 = l , a)

leading to the recursive rule:

(16) L(T_αβ_₃ = l) = max k<₃-i

| T_αβι₃ = l,a_α.ι-ι,

Case 3 : Gap in a_α

By analogy with case 2 the rule is

(17) L(T_αβιD = l) = max _k<1-ι L(T_αPιk,₃_₁=l) P (T_αβjk, -_₁₌l | T_αβl-, = 1 , a_αk, a_P -ι)

Case 4 : No match

If there is no match remaining then standard models adopt the form

where c is a constant. In order to evaluate L(T_αβl3=l) a model is needed for

a_αk, aι) , noting that {k,l} indexes the next match to the left of {i,j}.

By applying Bayes ' rule and re-organising:

(19) P

a_αk/a_Pι) =p (a_αk | a_Pi T_αβ_kι = l , T_αPl] = l)

T_αβl3 = l)/p(a_αk|a_βι_, T_αβl] = l)

which becomes

(20) P(T_αβkι = l|T_αβlD =

P (T_αβkl = l |

Conventionally the measurement distribution, p(a_αk| a_Pι, T_αpkι=l) , is modelled via a PAM weight matrix, with entries of the form sub(a_αk, a_Pι) =exp (s (a_ακ, a_Pι) ) , for example, penalising different substitutions, and the transition probability, (T_αβkι=l| T_αβι;ι=l) , between nearest matches by a gap penalty function, gap(Δl)= exp(g(Δl)), for example, dependent on gap length with a constant for gap opening. The numerator is assumed constant. With these models then the left hand side contribution in (10) is the maximum over the four cases :

(21) L(T_αβl]=l) = max{ L (T_αP, _.-_., -,-1=1) sub (a_αι-ι, a_P:-1) ,

max _k<i-i L(T_αβ,_k,_:.₁=l) gap(i-k-l) max{c^{l-^1},c^{]-^l}}

} By analogy, the contribution from the right hand side is

(22) R(T_αPl3 = l) = max{

a_β3+ι) , max k>₃+ι R(T_αβ,ι₊ι,k=l) gap(k-j-l) max k>ι₊ι R(T_αp,k,₃₊ι=l) gap(k-i-l) max{c<^H-¹-^1,,c^<N-^:|-¹>} }

Note that these are recursive formulae and each involves an accumulation over at most L_α+L_p entries . The evaluation over all possible matches is therefore O (L_αL_β (L_α+L_β) ) and is realised by computing (T_αβ13=l) and R(T_αβl-,=l) for all i ε L_α, and for all j ε L_β, and substituting into (9) :

(23) u(T_αβl3 = l)= L(Tαβι₃ = l) R(T_αβ₁₃ = l) sub(a_αι,a_β3)

A further saving in complexity can be achieved if the gap penalty function is linear since this allows for further recursion and reducing complexity of assessing all matches in a pair of strings to 0(L_αL_p) .

Returning to consider the upper bound on the all global solutions given an individual match in (7) . Taking logarithms leads to the formula for scoring hypothesised matches :

(24) S⁽ⁿ⁾(T_αβl3 =l)<=

+Σ _γl=a,β max _kε ⁽ⁿ⁾αιγ s⁽ⁿ⁾ (T_αγιk=l) +max _kεL ⁽ⁿ⁾ _γ3P s⁽ⁿ⁾ (τ_γβk3=ι:

+ Σ _γι_=α Σ δ!=_P,δcγ ma _εL iγ, kεL (.36 S " (T_γδkl = l! where S=logU and s=logu, respectively and the suffix n has been introduced to indicate that we have an iterative process where the lists of plausible matches, L, and in consequence, the scores, s and S change over time. For example, as one match is eliminated, so this affects the left-hand and right-hand contributions to other matches . In this way, information about elimination propagates throughout the system.

In practice at the onset of processing only the first term in the sum might be used, i.e., only the score S(T_αβl]=l) would be computed, whilst all other terms would be set at their maximum. While this would lead to an overestimated upper bound, it would allow very many obviously bad matches to be eliminated. Over time other terms would be added as resources allow.

With reference to Figure 5, there is shown a schematic diagram of a computer system 500 according to the invention. The system includes a main processor, or processors 510, in communication with fast access memory 520 inlcuding RAM 524 and ROM 522 parts. The system includes input and output devices 530 and can be in communication with other computers vai a network interface 540. A mass storage device 550, such as a hard disk, is also provided for storing files including data to be processed and data that has been processed.

An aspect of the invention is a computer program implementing the method as described above. The details of a suitable computer program are considered to be within the ability of a man of ordinary skill in the art, in view of the aforegoing description of the method and so have not been described in any detail . A general outline of the significant procedural steps to be implemented by suitable computer program is provided below.

Read sequence data a_x, , a

10 For each string α<=k {

For each string β> α {

For each character i<L_α { For each character j in _αPl {

Compute s(T_αγ;L-,=l)

} } } }

For each string α<=k {

For each string β> α {

For each character i<L_α { For each character j in L_αPl {

Compute S(T_αγι-,=l)

If S (T_αγι₃=l) <threshold eliminate j from list L_αPl

If change in lists go to 10

StandardRefineAlignment Data 552 representing the DNA strings to be matched is stored in a file 555 accessible by the processor. A file 557 is also provided for storing the final results of the processing such as the matching solution identified and an indication of the degree of match of the strings. The software controlling operation of the processor can also be stored on the mass storage device and in RAM used for short term storage of data during processing of the matching method.

The act of aligning two or more sequences together provides useful information on the sequences, provided that the sequences can be compared in a biologically meaningful manner .

Once constructed, a multiple sequence alignment, whether made of DNA or amino acids, can yield information simply not present in a single sequence. Such alignments can be used to compare a number of very similar sequences to see where they are similar and where they differ. Similarities may signify group characteristics responsible for similarities in protein structure and hence behaviour, whereas differences may signify mutations which lead to unwanted structure and behaviour, perhaps causing some genetic medical disorder, for instance. By identifying such mutations it may be possible to apply corrective measures to eliminate the mutations when present and hence the unwanted effects.

Multiple alignments can be also used as input to phylogenetic analysis programs, to study the evolutionary relationships between sequences, and between organisms. They can also pinpoint areas either particularly conserved or particularly divergent between related sequences . This in turn can yield information on the evolutionary processes undergone by those sequences .

Furthermore, such alignments at the protein level, when used as input to suitable protein modelling software, can help in understanding, and predicting, the structure of the protein in a way that individual sequences simply cannot do.

Although the method has been described with reference to matching strings of DNA bases, it will be appreciated that it can be applied to determining the best match between any sequences of signifiers which represent physical or non- physical entities. For instance, the method can be applied in the study of languages to determine the similarities between groups of words or characters in the same or different languages. The method can be applied to the analysis of powder diffraction patterns so as to determine the similarity of structures of materials by comparing the degree of match of representations of their diffraction patterns .

If only two strings of signifier are used, then the method can determine how closely the two match. If an entity and its representative sting are known, then the method can be used to identify an unknown entity by determining the match between its string with the string of the known entity. If the match is perfect, then the two entities will be identical. Provided the properties of an entity can be represented by a string of signifiers, the method can be used to compare two or more of the entities.

Claims

CLAIMS :

1. A method of determining the degree of match between a plurality of strings of signifiers, comprising the steps of: (i) identifying a possible signifier match between two signifiers in different strings;

(ii) determining an upper bound to the probability of any of the possible solutions containing the possible signifier match; (iii) comparing the upper probability bound for the possible solutions with a threshold probability; and

(iv) eliminating from the set of possible match solutions those solutions including the possible signifier match if the upper probability bound for the possible solutions is less than the threshold probability.

2. A method as claimed in claim 1, in which the upper bound is determined according to Bayesian probability theory.

3. A method as claimed in claim 1, in which the step of identifying a possible signifier match is repeated so as to identify all possible signifier matches between a first signifier in a first string and each signifier in each of the other plurality of strings.

4. A method as claimed in claim 3, in which the step of identifying a possible signifier match is repeated for each signifier in each of the plurality of strings so as to identify all possible signifier matches.

5. A method as claimed in claim 3 or claim 4, and including the step of repeating steps (iii) to (iv) for all the possible signifier matches identified.

6. A method as claimed in claim 1, in which only an identical signifier is a possible signifier match.

7. A method as claimed in claim 1, in which more than one signifier is a possible signifier match.

8. A method as claimed in claim 1, and including the step of recalculating the threshold.

9. A method as claimed in claim 1, and including the step of determining whether an acceptable number of possible solutions have been determined.

10. A method as claimed in claim 9, and including the step of recalculating the threshold if an acceptable number of possible solutions have not been determined.

11. A method as claimed in claim 1, in which at least two of the plurality of strings have the same number of signifiers.

12. A method as claimed in claim 11, in which each of the plurality of strings has the same number of signifiers.

13. A method as claimed in claim 1, in which each of the plurality of strings represents a sequence of DNA, or a sequence of amino acids that comprise a protein or proteins, or glycoconjugates .

14. A method as claimed in claim 1, in which each signifier represents a base.

15. A computer system for determining the degree of match between a plurality of strings of signifiers, comprising processing means, the processing means operating on data representing a plurality of strings of signifiers to:

(i) identify a possible signifier match between two signifier in different strings; (ii) determine an upper bound to the probability of any of the possible solutions containing the possible signifier match;

(iii) compare the upper probability bound for the possible solutions with a threshold probability; and (iv) eliminate from the set of possible match solutions those solutions including the possible signifier match if the upper probability bound for the possible solutions is less than the threshold probability.

16. A computer program executable on a computer to carry out a method as claimed in claim 1.

17. A computer program executable on a computer to provide a system as claimed in claim 15.

18. A computer readable medium bearing instructions executable on a computer to carryout a method as claimed in claim 1.

19. Computer program code, including instructions to carryout a method as claimed in claim 1.