WO2001061557A2 - Sequence matching - Google Patents

Sequence matching Download PDF

Info

Publication number
WO2001061557A2
WO2001061557A2 PCT/GB2001/000631 GB0100631W WO0161557A2 WO 2001061557 A2 WO2001061557 A2 WO 2001061557A2 GB 0100631 W GB0100631 W GB 0100631W WO 0161557 A2 WO0161557 A2 WO 0161557A2
Authority
WO
WIPO (PCT)
Prior art keywords
match
signifier
solutions
strings
probability
Prior art date
Application number
PCT/GB2001/000631
Other languages
French (fr)
Other versions
WO2001061557A3 (en
Inventor
Michael Turner
Simon Moss
Paul Zanelli
Original Assignee
Pc Multimedia Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/GB2000/000492 external-priority patent/WO2000049527A1/en
Priority claimed from GB0020743A external-priority patent/GB0020743D0/en
Application filed by Pc Multimedia Limited filed Critical Pc Multimedia Limited
Priority to AU2001233858A priority Critical patent/AU2001233858A1/en
Publication of WO2001061557A2 publication Critical patent/WO2001061557A2/en
Publication of WO2001061557A3 publication Critical patent/WO2001061557A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/768Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns

Definitions

  • the present invention relates to matching sequences of signifiers, and in particular to a method and system for determining the degree of match between multiple sequences of signifiers.
  • the task of multiple sequence alignment can be understood with reference to figure 1, which shows three sequences of DNA data, each comprising 8 signifiers representing 8 bases,
  • the goal is to find the best possible alignment between all three strings, given some model of the similarity between strings, which may include character insertions, character deletions and substitutions, and to provide some indication of the degree to which all three strings match each other.
  • it is standard practice to use dynamic programming. This guarantees a mathematically optimal alignment, given a table of scores for matches and mismatches between all characters and penalties for insertions or deletions of different lengths.
  • the quality of the alignments may be good. In more difficult cases, the alignments may give starting points for further automatic or manual refinement.
  • the present invention relates to a new approach to determining the degree of match between multiple sequences of signifiers, which is fast and gives good alignments of sequences under a wide range of realistic conditions.
  • a method of determining the degree of match between a plurality of strings of signifiers comprising the steps of:
  • a suitable means of computing an upper bound probability for regions of the solution space is defined.
  • regions with low upper bounds are eliminated by comparison with a threshold, and then effort is re-applied to those regions that remain.
  • the size of the regions covering the remaining space can be reduced without compromising resources, and more accurate upper bounds can be evaluated. In this way, the optimal solution can be identified through a process of exclusion.
  • the method and system are particularly suitable for determining the degree of matching between multiple sequences of DNA or proteins.
  • the invention is not limited to DNA or protein sequence matching, and can be applied in any field in which it is desired to determine the degree of match between multiple sequences of signifiers which represent either a physical entity (e.g. a base) or a non-physical entity (e.g. a word) .
  • the term signifier is considered to encompass all ways of representing an item in a sequence of items.
  • the invention can be used to determine the degree of match between scan lines in an image, by matching rows of pixels in one image to rows of pixels in a second images in order to determine correspondences between the tow images .
  • a suitable signifier in this case would be a measurement vector, giving the displacement of a pixel from an origin.
  • Strings of any type of signifiers can be matched because the sequence alignment invention is an order-preserving string matcher. So any suitable signifier can be used provided it can be used to determine the position of an element in the string of elements. Further the invention can use any model for the similarity between two signifiers in different strings .
  • the upper bound is determined according to Bayesian probability theory.
  • the step of identifying a possible signifier match is repeated so as to identify all possible signifiers matches between a first signifier in a first string and each signifier in each of the other plurality of strings.
  • the step of identifying a possible signifier match is repeated for each signifier in each of the plurality of strings so as to identify all possible signifier matches.
  • the method is applied simultaneously to all possible sequence alignments simultaneously. The method ensures that all plausible alignments are examined. Processing being effectively the task of eliminating implausible match schemes so as to hone in on the best match scheme, ie the solution, through a process of exclusion.
  • the method includes the step of repeating steps (iii) and (iv) for all the possible signifier matches identified.
  • more than one signifier is a possible signifier match.
  • the method includes the step of recalculating the threshold. In this way different regions of phase space can be investigated more accurately, if the difference between local solutions is not sufficient to determine the best solution.
  • the method includes the step of determining whether an acceptable number of possible global solutions have been determined. Once a tractable number of possible solutions have been determined, the main body of the method can be terminated.
  • the method includes the step of recalculating the threshold if an acceptable number of possible global solutions have not been determined. This helps to allow the best solution be identified by enabling local solutions to be distinguished.
  • At least two of the plurality of strings can have the same number of signifiers.
  • Each of the plurality of strings can have the same number of signifiers.
  • Each of the plurality of strings can represent a sequence of DNA, or a protein or proteins.
  • Each signifier can represent a base.
  • a computer system for determining the degree of match between a plurality of strings of signifiers comprising processing means, and the processing means operating on data representing a plurality of strings of signifiers, to:
  • Figure 1 shows a schematic diagram illustrating the matching method according to the invention being applied to three stings of signifiers representing DNA sequences
  • Figure 2 shows a flow chart illustrating the method according to the present invention
  • Figure 3 shows a sequence of diagrams illustrating the probability of all the possible solutions
  • Figure 4 shows pairs of strings of signifiers illustrating aspects of a matching model used in the method
  • Figure 5 shows a schematic diagram of a sequence matching computer system according to a further aspect of the invention.
  • Figure 1 shows representations of three DNA sequences a l t a 2 , a 3 each comprising a string 110, 120, 130 of eight bases represented by the signifiers, or characters, c, t, a and g.
  • the problem addressed is to determine the actual degree of match, similarity or alignment, between the three DNA strings, from all the possible matching schemes that the rules of a model permit.
  • the rules of the DNA model also permit substitutions of some bases by others, so that in this case, the match aog is also a possible match.
  • the model also permits for spacings in the sequences so that a possible match base in a second string does not have to be in the same position as a base in a first string in order for there to be a degree of matching. However, the further out of position the possible match base, the lesser the degree of matching.
  • the strength of matching as a function of the position of the base is a feature of the model used.
  • a first step 210 in the matching method of the invention all possible matching schemes that are allowed within the rules of the model are identified.
  • the left most signifier, or character, representing a c base in the first string of DNA ai is considered and all its possible matches with all the bases in the second string of DNA a 2 are identified.
  • the only possible match is to another c base, only matches to the first and second bases in the second string are identified.
  • the match to the first c base in the second sting is stronger than the match to the second c base, as their relative positions in the strings are the same.
  • the possible matches for the next base in the first string are identified. This is then repeated for each base in the string until all the possible signifier matches for all the signifiers in the first string with the second string have been identified. Note that owing to the allowable substitution aog, the penultimate base in the first string has possible matches to both a and g bases in the second string.
  • This procedure is then repeated for the second DNA string a 2 with respect to the third DNA string a 3 so as to identify all possible base matches between the second and third strings.
  • All possible string matching schemes ie combinations of one to one mappings between the bases of each string for all three strings, can be constructed from the possible base matches.
  • the task is then to determine which of all of the possible matching schemes, has the greatest degree of match between all three strings; ie to determine from the set of all possible solutions the solution having the greatest degree of matching, which therefore is the best match between all the strings within the matching rules of the model .
  • the next step 220 is to determine an upper bound on the probability of a solution including a particular one of the possible base matches.
  • Figure 3a shows a diagram 310 illustrating steps of the method.
  • the left ordinate axis represents a measure of probability
  • the right ordinate axis represents the degree of matching between the strings
  • the abscissa represents the set of all possible matching schemes, ie the set of all possible solutions to the matching problem.
  • the line 315 shows the degree of matching between the strings for a particular matching scheme. The best match, ie the solution to the matching problem, is given by the matching scheme 320.
  • This possible match has associated with it a set of matching schemes or solutions 330. Although shown as occupying a connected, single part of the set of possible solutions, for the sake of simplicty, the set of solutions may well be distributed about the set of all possible solutions. An upper bound 335 for the probability of the set of matching schemes 330 that include that particular possible signifier match being the solution is calculated.
  • a threshold probability 340 is calculated in line with Bayesian probability theory.
  • the upper bound 335 is compared 230 with the threshold. As the upper bound is less than the threshold probability, all possible matching schemes, or solutions, including that possible match can be eliminated 345 from the set of possible solutions 240, as illustrated in Figure 3b. Processing is the task of eliminating implausible matches and seeing how this affects the possible solutions. This is an iterative process. For example, elimination of an unlikely possible signifier match in itself leads to the elimination of other possible matching schemes since they were dependent upon these that signifier match for their existence in the first place.
  • the procedure is then repeated 250 for each of the possible signifier matches identified.
  • the possible match of the second c in the first string and the first c in the second string There is a set of possible matching schemes 350 including this particular match.
  • An upper bound 352 for the probability of a solution containing this match is calculated, and compared with the threshold probability 340.
  • this set of possible solutions 350 are retained.
  • the remaining possible signifier matches are processed, for instance the set of solutions 356 containing the identified possible match of the last g in the second string and the first g in the third string, has the upper bound on its probability 358 calculated, compared with the threshold, and the set of solutions including that possible match eliminated.
  • Figure 3d Eventually the position illustrated in Figure 3d will be reached, in which all possible solutions having an upper probability bound lower than the threshold have been eliminated.
  • Figure lb A greatly reduced subset of possible signifier matches remains out of the original set of all possible signifier matches. It is determined 250 whether sufficient possible solutions have been eliminated for the remaining solutions to be evaluated individually in an acceptable time by the processing power available. The number of possible signifier matches can be reduced further by repeating the method, with a recalculated threshold probability 360. Hence when the signifier match of the first c in the first string and the second c in the second string is considered again, the upper bound on the probability of that set of solutions is less than the recalculated threshold 360 and so that possible signifier match can be eliminated.
  • a computationally tractable set 362 of possible signifier match schemes, or solutions is identified out of all the possible solutions.
  • This set of solutions 360 is exhaustively searched 260 by any conventional technique in an acceptable amount of time by the processing power available so as to determine the matching scheme 362 having the greatest degree of match between all three stings.
  • the results of the process are then saved 270, to provide an indication of the matching scheme, as illustrated in Figure lc, having the greatest degree of matching, or alignment, between the strings and a measure of that degree of matching.
  • the approach uses pattern recognition based upon three key conditions :
  • Processing is resource-driven such that the calculations that can be performed are constrained by the memory available and the speed of operations required, as defined by the operator.
  • T ⁇ T ⁇ ⁇ _j, for all ⁇ , ⁇ ⁇ K, for all i ⁇ L ⁇ _,j ⁇ L ⁇ ⁇ , is the binary match matrix for the strings
  • is the space of possible global solutions for T
  • L ⁇ is the length of string ⁇ .
  • T' denotes the matches for all characters excluding the pair under consideration
  • ⁇ 1 is the space of possible solutions for this set.
  • Processing will continue until no solutions fall below the relevant threshold. At any time processing may be re-started by heuristically increasing the threshold, or alternatively, the remaining solutions may be recorded and processed in some manner .
  • the global solution space ⁇ is iteratively reduced by identifying and eliminating implausible matches. Elimination is achieved by comparing an upper bound on the probability of any global solution containing a match against a threshold. Computational overheads are addressed by using a coarseness function Y that, whilst not necessarily delivering the lowest upper bound, is sufficient for identifying inappropriate regions of the solution space.
  • T ⁇ is the set of matches between strings ⁇ and ⁇ .
  • the expression in (7) can be expanded to make explicit contribution from the characters under consideration:
  • L ⁇ ⁇ _ is the list of possible matches in sequence ⁇ for character i in string ⁇ .
  • the assessments are made simultaneously over all possible pairs of sequences, and further, that they are used to compute an upper bound on all plausible global solutions rather than to assess and refine a few sub-optimal solutions.
  • the quantity of interest in (7) is of the form max T ' ⁇ P , that is, the maximum probability of a pairwise solution given an individual match.
  • a further saving in complexity can be achieved if the gap penalty function is linear since this allows for further recursion and reducing complexity of assessing all matches in a pair of strings to 0(L ⁇ L p ) .
  • FIG. 5 there is shown a schematic diagram of a computer system 500 according to the invention.
  • the system includes a main processor, or processors 510, in communication with fast access memory 520 inlcuding RAM 524 and ROM 522 parts.
  • the system includes input and output devices 530 and can be in communication with other computers vai a network interface 540.
  • a mass storage device 550 such as a hard disk, is also provided for storing files including data to be processed and data that has been processed.
  • An aspect of the invention is a computer program implementing the method as described above.
  • the details of a suitable computer program are considered to be within the ability of a man of ordinary skill in the art, in view of the aforegoing description of the method and so have not been described in any detail .
  • a general outline of the significant procedural steps to be implemented by suitable computer program is provided below.
  • StandardRefineAlignment Data 552 representing the DNA strings to be matched is stored in a file 555 accessible by the processor.
  • a file 557 is also provided for storing the final results of the processing such as the matching solution identified and an indication of the degree of match of the strings.
  • the software controlling operation of the processor can also be stored on the mass storage device and in RAM used for short term storage of data during processing of the matching method.
  • the act of aligning two or more sequences together provides useful information on the sequences, provided that the sequences can be compared in a biologically meaningful manner .
  • a multiple sequence alignment can yield information simply not present in a single sequence.
  • Such alignments can be used to compare a number of very similar sequences to see where they are similar and where they differ. Similarities may signify group characteristics responsible for similarities in protein structure and hence behaviour, whereas differences may signify mutations which lead to unwanted structure and behaviour, perhaps causing some genetic medical disorder, for instance. By identifying such mutations it may be possible to apply corrective measures to eliminate the mutations when present and hence the unwanted effects.
  • Multiple alignments can be also used as input to phylogenetic analysis programs, to study the evolutionary relationships between sequences, and between organisms. They can also pinpoint areas either particularly conserved or particularly divergent between related sequences . This in turn can yield information on the evolutionary processes undergone by those sequences .
  • the method has been described with reference to matching strings of DNA bases, it will be appreciated that it can be applied to determining the best match between any sequences of signifiers which represent physical or non- physical entities.
  • the method can be applied in the study of languages to determine the similarities between groups of words or characters in the same or different languages.
  • the method can be applied to the analysis of powder diffraction patterns so as to determine the similarity of structures of materials by comparing the degree of match of representations of their diffraction patterns .
  • the method can determine how closely the two match. If an entity and its representative sting are known, then the method can be used to identify an unknown entity by determining the match between its string with the string of the known entity. If the match is perfect, then the two entities will be identical. Provided the properties of an entity can be represented by a string of signifiers, the method can be used to compare two or more of the entities.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of determining the degree of match between a plurality of strings of signifiers. The method comprises the steps of: (i) identifying a possible signifier match between two signifiers in different strings; (ii) determining an upper bound to the probability of any of the possible solutions containing the possible signifier match; (iii) comparing the upper probability bound for the possible solutions with a threshold probability; and (iv) eliminating from the set of possible match solutions those solutions including the possible signifier match if the upper probability bound for the possible solutions is less than the threshold probability.

Description

Sequence Matching
The present invention relates to matching sequences of signifiers, and in particular to a method and system for determining the degree of match between multiple sequences of signifiers.
The particular application of the invention in the field of DNA or protein sequence alignment will be discussed merely by way of an example of an application of the invention.
The simultaneous alignment of DNA or protein sequences is now an important part of molecular biology. Multiple alignments are used to (1) find diagnostic patterns to characterise protein families, (2) to detect or demonstrate hόmology between new sequences and existing families of sequences, (3) to help predict the secondary and tertiary structures of new sequences and (4) as an essential prelude to molecular evolutionary analysis.
The rate of appearance of new sequence data is increasing and the development of efficient and accurate automatic methods is, therefore, of major importance.
The task of multiple sequence alignment can be understood with reference to figure 1, which shows three sequences of DNA data, each comprising 8 signifiers representing 8 bases, The goal is to find the best possible alignment between all three strings, given some model of the similarity between strings, which may include character insertions, character deletions and substitutions, and to provide some indication of the degree to which all three strings match each other. In order to align just two sequences, it is standard practice to use dynamic programming. This guarantees a mathematically optimal alignment, given a table of scores for matches and mismatches between all characters and penalties for insertions or deletions of different lengths.
Attempts at generalising dynamic programming to multiple alignments have been limited to small numbers of short sequences. This is because of the combinatorial nature of the problem. For example, for much more than eight or so proteins of average length, the problem is incomputable given current computer power. Therefore, all current methods capable of handling larger problems in practical time scales make use of heuristics.
Currently, the most widely used approach for multiple sequence alignment is the progressive method of Feng and Doolittle. This exploits the fact that homologous sequences are related through evolution. It is therefore assumed possible to build up a multiple alignment progressively by a series of pairwise alignments (following the branching order in a p ylogenetic tree) .
First the most closely related sequences are aligned, gradually adding in the more distant ones.
In simple cases, the quality of the alignments may be good. In more difficult cases, the alignments may give starting points for further automatic or manual refinement.
The major reason for the limitations of this prior art approach is the limited pattern recognition it employs. Multiple sequence alignment is realised through a chain of pairwise comparisons. Only the best-guess alignments early on in the chain are passed on in processing.
That is, best-guess information is passed up the processing chain. The success of this approach depends critically on obtaining good initial alignments, but this is not possible in general, and this approach does not guarantee that the best global alignment will be obtained, as that required all the sequence data to be considered. In this specification the term global is used to indicate that all possible eventualities are considered. Hence, for instance, the global solution is the best solution out of the set of all possible solutions.
The result of utilising a λbest-guess' approach is that errors (which correspond to misalignments of the sequences compared to the best alignment of the sequence) which are introduced early on necessarily pass on to subsequent stages, causing mistakes there and thereby leading to a non- optimal solution. Attempts to improve on the recovered alignment may be subsequently made, but in essence, these simply make minor refinements to the current best-guess solution (i.e. performing a gradient-based search around the local solution) and are incapable of recovering from non- trivial errors .
The present invention relates to a new approach to determining the degree of match between multiple sequences of signifiers, which is fast and gives good alignments of sequences under a wide range of realistic conditions.
According to a first aspect of the present invention, there is provided a method of determining the degree of match between a plurality of strings of signifiers, comprising the steps of:
(i) identifying a possible signifier match between two signifiers in different strings; (ii) determining an upper bound on the probability of a any global match solution containing the possible signifier match;
(iii) comparing the upper probability bound with a threshold probability; and (iv) eliminating from the set of global match solutions those solutions including the possible signifier match if the upper probability bound for the possible match solution is less than the threshold probability.
Given the available resources, a suitable means of computing an upper bound probability for regions of the solution space is defined. Through an iterative process, regions with low upper bounds are eliminated by comparison with a threshold, and then effort is re-applied to those regions that remain. As more and more of the solution space is eliminated, so the size of the regions covering the remaining space can be reduced without compromising resources, and more accurate upper bounds can be evaluated. In this way, the optimal solution can be identified through a process of exclusion.
The method and system are particularly suitable for determining the degree of matching between multiple sequences of DNA or proteins. However, the invention is not limited to DNA or protein sequence matching, and can be applied in any field in which it is desired to determine the degree of match between multiple sequences of signifiers which represent either a physical entity (e.g. a base) or a non-physical entity (e.g. a word) . The term signifier is considered to encompass all ways of representing an item in a sequence of items. For instance the invention can be used to determine the degree of match between scan lines in an image, by matching rows of pixels in one image to rows of pixels in a second images in order to determine correspondences between the tow images . A suitable signifier in this case would be a measurement vector, giving the displacement of a pixel from an origin.
Strings of any type of signifiers can be matched because the sequence alignment invention is an order-preserving string matcher. So any suitable signifier can be used provided it can be used to determine the position of an element in the string of elements. Further the invention can use any model for the similarity between two signifiers in different strings .
Preferably, the upper bound is determined according to Bayesian probability theory.
Preferably, the step of identifying a possible signifier match is repeated so as to identify all possible signifiers matches between a first signifier in a first string and each signifier in each of the other plurality of strings.
Preferably, the step of identifying a possible signifier match is repeated for each signifier in each of the plurality of strings so as to identify all possible signifier matches. In this way, the method is applied simultaneously to all possible sequence alignments simultaneously. The method ensures that all plausible alignments are examined. Processing being effectively the task of eliminating implausible match schemes so as to hone in on the best match scheme, ie the solution, through a process of exclusion.
Preferably, the method includes the step of repeating steps (iii) and (iv) for all the possible signifier matches identified.
Preferably, only an identical signifier is a possible signifier match.
Preferably, more than one signifier is a possible signifier match.
Preferably, the method includes the step of recalculating the threshold. In this way different regions of phase space can be investigated more accurately, if the difference between local solutions is not sufficient to determine the best solution.
Preferably, the method includes the step of determining whether an acceptable number of possible global solutions have been determined. Once a tractable number of possible solutions have been determined, the main body of the method can be terminated.
Preferably, the method includes the step of recalculating the threshold if an acceptable number of possible global solutions have not been determined. This helps to allow the best solution be identified by enabling local solutions to be distinguished.
At least two of the plurality of strings can have the same number of signifiers. Each of the plurality of strings can have the same number of signifiers. Each of the plurality of strings can represent a sequence of DNA, or a protein or proteins. Each signifier can represent a base.
According to a further aspect of the invention, there is provided a computer system for determining the degree of match between a plurality of strings of signifiers, comprising processing means, and the processing means operating on data representing a plurality of strings of signifiers, to:
(i) identify a possible signifier match between two signifiers in different strings;
(ii) determine an upper bound to the probability of any global match solution containing the possible signifier match;
(iii) compare the upper probability bound for the possible match with a threshold probability; and
(iv) eliminate from the set of global match solutions those solutions including the potential match if the upper probability bound for the possible match solution is less than the threshold probability.
According to further aspects of the invention there are provided a computer program computer program code and a computer readable medium bearing instructions, which when executed by a computer carryout the method of the invention or provide the system of the invention.
An embodiment of the invention will now be described in detail, by way of example only, and with reference to the accompanying drawings, in which: Figure 1 shows a schematic diagram illustrating the matching method according to the invention being applied to three stings of signifiers representing DNA sequences ; Figure 2 shows a flow chart illustrating the method according to the present invention;
Figure 3 shows a sequence of diagrams illustrating the probability of all the possible solutions; Figure 4 shows pairs of strings of signifiers illustrating aspects of a matching model used in the method; and
Figure 5 shows a schematic diagram of a sequence matching computer system according to a further aspect of the invention.
Although the invention is described with reference to its application in the field of DNA sequence matching, it will be appreciated that the invention is applicable to any situation in which it is required to determine the degree of the best match between a plurality of strings of signifiers representative of physical or non-physical entities. The same items in different Figures share common reference numerals unless indicated otherwise.
Firstly a general discussion of the method and system for determining the best match between a plurality of strings of DNA bases is provided, before a more detailed mathematical description.
Figure 1 shows representations of three DNA sequences al t a2 , a3 each comprising a string 110, 120, 130 of eight bases represented by the signifiers, or characters, c, t, a and g. The problem addressed is to determine the actual degree of match, similarity or alignment, between the three DNA strings, from all the possible matching schemes that the rules of a model permit.
In this case, the model of how DNA behaves has the rules that identical bases match, ie. c=>c , tot, aoa, and gog. These are all possible matches. The rules of the DNA model also permit substitutions of some bases by others, so that in this case, the match aog is also a possible match. The model also permits for spacings in the sequences so that a possible match base in a second string does not have to be in the same position as a base in a first string in order for there to be a degree of matching. However, the further out of position the possible match base, the lesser the degree of matching. The strength of matching as a function of the position of the base is a feature of the model used.
At the onset of processing all possible matches are available to the system as possibilities, bar those eliminated due to prior knowledge: eg if we know that two signifiers cannot match they may be excluded from consideration .
With reference also to Figure 2 , as a first step 210 in the matching method of the invention, all possible matching schemes that are allowed within the rules of the model are identified. With reference to Figure la, the left most signifier, or character, representing a c base in the first string of DNA ai is considered and all its possible matches with all the bases in the second string of DNA a2 are identified. As the only possible match is to another c base, only matches to the first and second bases in the second string are identified. The match to the first c base in the second sting is stronger than the match to the second c base, as their relative positions in the strings are the same.
Once all the possible signifier matches have been identified for the first base, the possible matches for the next base in the first string are identified. This is then repeated for each base in the string until all the possible signifier matches for all the signifiers in the first string with the second string have been identified. Note that owing to the allowable substitution aog, the penultimate base in the first string has possible matches to both a and g bases in the second string.
This procedure is then repeated for the second DNA string a2 with respect to the third DNA string a3 so as to identify all possible base matches between the second and third strings.
This procedure is then repeated for the third DNA string a3 with respect to the first DNA string ax so as to identify all possible base matches between the third and first strings, which are not shown in Figure la for the sake of clarity.
All possible string matching schemes, ie combinations of one to one mappings between the bases of each string for all three strings, can be constructed from the possible base matches. The task is then to determine which of all of the possible matching schemes, has the greatest degree of match between all three strings; ie to determine from the set of all possible solutions the solution having the greatest degree of matching, which therefore is the best match between all the strings within the matching rules of the model . The next step 220 is to determine an upper bound on the probability of a solution including a particular one of the possible base matches. Figure 3a shows a diagram 310 illustrating steps of the method. The left ordinate axis represents a measure of probability, the right ordinate axis represents the degree of matching between the strings, and the abscissa represents the set of all possible matching schemes, ie the set of all possible solutions to the matching problem. The line 315 shows the degree of matching between the strings for a particular matching scheme. The best match, ie the solution to the matching problem, is given by the matching scheme 320.
Consider the possible match of the last t base in the first DNA string with the first t base in the second DNA string.
This possible match has associated with it a set of matching schemes or solutions 330. Although shown as occupying a connected, single part of the set of possible solutions, for the sake of simplicty, the set of solutions may well be distributed about the set of all possible solutions. An upper bound 335 for the probability of the set of matching schemes 330 that include that particular possible signifier match being the solution is calculated.
Next, a threshold probability 340 is calculated in line with Bayesian probability theory. The upper bound 335 is compared 230 with the threshold. As the upper bound is less than the threshold probability, all possible matching schemes, or solutions, including that possible match can be eliminated 345 from the set of possible solutions 240, as illustrated in Figure 3b. Processing is the task of eliminating implausible matches and seeing how this affects the possible solutions. This is an iterative process. For example, elimination of an unlikely possible signifier match in itself leads to the elimination of other possible matching schemes since they were dependent upon these that signifier match for their existence in the first place.
The procedure is then repeated 250 for each of the possible signifier matches identified. Consider the possible match of the second c in the first string and the first c in the second string. There is a set of possible matching schemes 350 including this particular match. An upper bound 352 for the probability of a solution containing this match is calculated, and compared with the threshold probability 340. However, as illustrated in Figure 3b, as the upper bound on the probability of the solution containing this possible match is greater than the threshold probability, this set of possible solutions 350 are retained.
The remaining possible signifier matches are processed, for instance the set of solutions 356 containing the identified possible match of the last g in the second string and the first g in the third string, has the upper bound on its probability 358 calculated, compared with the threshold, and the set of solutions including that possible match eliminated.
Eventually the position illustrated in Figure 3d will be reached, in which all possible solutions having an upper probability bound lower than the threshold have been eliminated. This can be schematically represented by Figure lb. A greatly reduced subset of possible signifier matches remains out of the original set of all possible signifier matches. It is determined 250 whether sufficient possible solutions have been eliminated for the remaining solutions to be evaluated individually in an acceptable time by the processing power available. The number of possible signifier matches can be reduced further by repeating the method, with a recalculated threshold probability 360. Hence when the signifier match of the first c in the first string and the second c in the second string is considered again, the upper bound on the probability of that set of solutions is less than the recalculated threshold 360 and so that possible signifier match can be eliminated.
Eventually a computationally tractable set 362 of possible signifier match schemes, or solutions, is identified out of all the possible solutions. This set of solutions 360 is exhaustively searched 260 by any conventional technique in an acceptable amount of time by the processing power available so as to determine the matching scheme 362 having the greatest degree of match between all three stings. The results of the process are then saved 270, to provide an indication of the matching scheme, as illustrated in Figure lc, having the greatest degree of matching, or alignment, between the strings and a measure of that degree of matching.
Through this iterative process of eliminating implausible matches, good global alignments are identified by exclusion. This is in contrast to all existing methodologies that attempt to identify a global solution directly through the propagation of best-guess pairwise alignments.
The approach uses pattern recognition based upon three key conditions :
1. Calculations are underpinned by Bayesian probability theory. 2. The method requires that all solutions (i.e., all possible alignments) be assessed.
3. Processing is resource-driven such that the calculations that can be performed are constrained by the memory available and the speed of operations required, as defined by the operator.
Mathematical aspects of the method will now be described in greater detail, with particular reference to Figures 1 and
4. Consider a set of K strings of characters a={ax,. ,ak} which may represent K sequences of DNA or protein sequence data. The goal is to derive the best global alignment for the strings given some model of similarity, which may include character insertions, character deletions and substitutions .
From conditions 2 and 3 an holistic, probability theory approach is utilised, requiring:
(1) T=arg max TεΦP(T*=T| a)
where T={Tαβ_j, for all α,β ε K, for all i ε L<_,j ε Lβ} , is the binary match matrix for the strings, Φ is the space of possible global solutions for T, and Lα is the length of string α. The two characters indexed i in string α and j in string β are matched (aligned) in the global solution if and only if Tαβij=l.
This aim is not evaluated directly, i.e., by actively searching for and refining solutions within the global solution space, this being the approach of existing gradient-based techniques. Rather, the best solutions are determined indirectly, by eliminating bad solutions from Φ, In doing so all of the solution space is implicitly examined, as required by condition 2, as follows.
Solutions are grouped together since examining each individual solution in isolation is computationally intractable in general, and thereby breaking condition 3.
Consider all solutions that contain the individual match Tαβi3=l, say. That is, the strings aα and aβ are aligned and fixed at aα_<->aβ-].
The maximum probability of any one of these solutions is
(2) U(Tαpl]=l)=max T'εΦ' P (Tαβl3=l, T' | a)
where T' denotes the matches for all characters excluding the pair under consideration, and Φ1 is the space of possible solutions for this set.
Now any group of solutions whose lowest upper bound probability is below some known lower bound value, L<n) , cannot contain the optimum solution. Therefore, we can eliminate these groups from consideration. The rule for Tαβ1D at some iteration time n is :
eliminate any solution containing the match Tαβi:)=l if
(3) U(Tαβl3=l)< L<n)
By eliminating this set of solutions the size of the space which needs to be considered at the next time step is effectively reduced. That is, the new search space at time n+1, φ{n+1), will not contain these solutions, which will affect future processing. In relation to the alignment, if the possibility Tαβι:)=l is excluded, then this will affect the upper bound on other matches at the next iteration.
The computation of the upper bound has not yet been defined, and in general may be computationally expensive, thereby breaking condition 3. The solution is to identify quantities of the form Y(n) such that Y(n)>= U(n) which can be computed in a given time and using a given amount of memory. The elimination rules then become:
eli-Tiiπate any solution containing the match Tαβι:ι =l if
(4) Y(n) (Tαβl]=l)< L(n)
Y (n> is evaluated by combining Bayesian probability theory with rules of inequality. Its form may change over the iterative cycles in order to accommodate condition 3. For example, at the onset of processing Y(n) may be coarsely and quickly evaluated, but provided it obeys Y(n>>= U(n) then only bad solutions will be eliminated. Towards the end of processing when only a few solutions remain, a more sophisticated and computationally intensive means of computing Y may be employed, such that Y(n) approximates U(n) provided condition 3 is not violated.
Processing will continue until no solutions fall below the relevant threshold. At any time processing may be re-started by heuristically increasing the threshold, or alternatively, the remaining solutions may be recorded and processed in some manner .
In summary, the global solution space Φ is iteratively reduced by identifying and eliminating implausible matches. Elimination is achieved by comparing an upper bound on the probability of any global solution containing a match against a threshold. Computational overheads are addressed by using a coarseness function Y that, whilst not necessarily delivering the lowest upper bound, is sufficient for identifying inappropriate regions of the solution space.
A detailed mathematical description of the application of this invention to the alignment of DNA or protein sequence data, and in particular, to the computation of upper bound quantities for match solutions. The development leads to relatively simple expressions for these upper bound quantities .
Consider the upper bound quantity
( 5 ) U ( Tαβl3 = l ) =max τ. ε Φ. P ( Tαβl3 = l , T ' | a )
i.e., the maximum probability associated with any global solution containing the individual match aαι<->aβ-,.
Exact development is intractable due to the complex interactions between pairs. However, obtaining an upper bound is straightforward. By noting that maxX:Y P(xεX,yεY) is upper bounded by maxx P(xεX) maxγ P (yεY) , it follows that
(6) U(Tαβ1D =l,T' |a)<= maxτ< αβ P ( TαβιD = l , T ' αβ | a ) π
Figure imgf000019_0001
P(Tγβ,Tαβι:) = l|a)maxTαγ P (Tαγ, TaPl-, = l | a) π γ!.α π δ!=β,δ<y maxTγδ P(Tγδ,Tαβ13 = l I a)
where Tαβ is the set of matches between strings α and β. The expression in (7) can be expanded to make explicit contribution from the characters under consideration:
(7) U(TαPl3=l,T' |a)<= max x'αβ P(TaPl3 = l,T'aP|a) π γ!=αιβ max kεLαιγ max T'αγ P (Tαγιk=l, T' αγ| a) max kεLβ]γ max τ-γβ P (TyPk- = l , T'γβ | a)
Li γ!=α IT δ!=β,δ<γ I^ia __αιγ, ε β]δ max T'γδ
Figure imgf000019_0002
where Lαβ_ is the list of possible matches in sequence β for character i in string α.
According to (7) an upper bound probability for a global solution containing the individual match Tαβi;)=l is evaluated by considering the alignments for pairs of sequences. In the method, the assessments are made simultaneously over all possible pairs of sequences, and further, that they are used to compute an upper bound on all plausible global solutions rather than to assess and refine a few sub-optimal solutions.
The quantity of interest in (7) is of the form max T'αβP
Figure imgf000019_0003
, that is, the maximum probability of a pairwise solution given an individual match.
In order to develop this quantity, consider all possible alignments of strings α and β that contain the individual match Tαβ1D=l. That is, the strings are aligned and fixed at aαι<—^bp3, and it is necessary to compute the upper bound on any pairwise solution that contains this match. Using the notation
(8) u(TαβlD =l)=max τ'oβ P (Tαβ_-,=1 , T' αβ | a)
By applying Bayes ' rule and considering substrings on the left and right of the line defined by Tapl-,=1, this can be rewritten as:
(9) u(Tαρ13 = l)=L(Tαβl] = l) R(Tαβι: = l) P (Tαβl3 = l | a) /p (a)
where L(TαβιD=l) is the contribution from characters to the left-hand side and is given by
(10) L(Tαβι3 = l) = max
Figure imgf000020_0001
P(Tαβιj = l,T*" ββ13 I a)
where ^αβ..-, is shorthand for these assignments, and likewise R(Tαβl]=l) is the contribution from characters to the right- hand side given by
(11) R(Tαβ1D = l) = max τ αβιD R( αβι3 = l,T"*αβ1D I a)
The contribution made by the left hand side, L(Tαβl-,=l) is developed. Development for the right hand side is immediate by analogy. Consider when a match on moving leftwards from aαι<—>aβ:) might next be encountered. In doing so it is necessary to take into account that gaps may be introduced in the strings .
With reference to Figures 4a,b,c & d there are four cases: (i) no gap to the match, i.e., aα>1-ι<—_>bP/ (ii) a gap in aα but not in aβ, (iii) a gap in aβ but not in aα or (iv) no further match.
These cases are exhaustive and mutually exclusive. It is therefore possible to consider each in turn and look for the maximum response in (8) .
Case 1: No Gap
In this case the hypothesis
Figure imgf000021_0001
is made, so the contribution from the left hand side is
(12) L(Tα l-j = l) = max T α i] P (Tα ,ι-ι,]-ι = l, T cu3,_.-i,_-i I Tαpl] = l, a)
which, by assuming that information about the match aα>1^--_bβιD is redundant if a nearer match is to hand, i.e., α)1.!<—>bP -ι, gives
(13) L(Tαβ_D=l) = max .*" <_&_.-_., _-ι P (T*" αβ,ι-ι,3-ι | Tαβ,1-ι.D-1=l , a) P (Tαβ,ι- ι,]-ι=l I TαPl-,=l,a)
which using (8) leads to the recursive rule:
L(Tαi]
Figure imgf000021_0002
, aaι-_ ap-,-i)
Case 2 : Gap in ap In this case it is necessary that the nearest match for aαι-1 exists at some point k<j-l in ap. It is necessary to consider all possible non-zero gap lengths:
(14) L(TαPl3 = l) = max k ]-ι max _ ,_ -ι,k
Figure imgf000022_0001
Figure imgf000022_0002
which becomes
(15) L(TαPi3 = l) = max k<D-ι max τ<"αβ,_.-ι,k P(T^αβ,ι-ι,k | Tαβιl_ lik=l , a) P (Tββ,1-1,k=l I TaPl3 = l , a)
leading to the recursive rule:
(16) L(Tαβ_3 = l) = max k<3-i
Figure imgf000022_0003
| Tαβι3 = l,aα.ι-ι,
Case 3 : Gap in aα
By analogy with case 2 the rule is
(17) L(TαβιD = l) = max k<1-ι L(TαPιk,3_1=l) P (Tαβjk, -_1=l | Tαβl-, = 1 , aαk, aP -ι)
Case 4 : No match
If there is no match remaining then standard models adopt the form
Figure imgf000022_0004
where c is a constant. In order to evaluate L(Tαβl3=l) a model is needed for
Figure imgf000023_0001
aαk, aι) , noting that {k,l} indexes the next match to the left of {i,j}.
By applying Bayes ' rule and re-organising:
(19) P
Figure imgf000023_0002
aαk/aPι) =p (aαk | aPi Tαβkι = l , TαPl] = l)
Figure imgf000023_0003
Tαβl3 = l)/p(aαk|aβι, Tαβl] = l)
which becomes
(20) P(Tαβkι = l|TαβlD =
Figure imgf000023_0004
P (Tαβkl = l |
Figure imgf000023_0005
Conventionally the measurement distribution, p(aαk| aPι, Tαpkι=l) , is modelled via a PAM weight matrix, with entries of the form sub(aαk, aPι) =exp (s (aακ, aPι) ) , for example, penalising different substitutions, and the transition probability, (Tαβkι=l| Tαβι;ι=l) , between nearest matches by a gap penalty function, gap(Δl)= exp(g(Δl)), for example, dependent on gap length with a constant for gap opening. The numerator is assumed constant. With these models then the left hand side contribution in (10) is the maximum over the four cases :
(21) L(Tαβl]=l) = max{ L (TαP, _.-_., -,-1=1) sub (aαι-ι, aP:-1) ,
Figure imgf000023_0006
max k<i-i L(Tαβ,k,:.1=l) gap(i-k-l) max{c{l-1},c{]-l}}
} By analogy, the contribution from the right hand side is
(22) R(TαPl3 = l) = max{
Figure imgf000024_0001
aβ3+ι) , max k>3+ι R(Tαβ,ι+ι,k=l) gap(k-j-l) max k>ι+ι R(Tαp,k,3+ι=l) gap(k-i-l) max{c<H-1-1,,c<N-:|-1>} }
Note that these are recursive formulae and each involves an accumulation over at most Lα+Lp entries . The evaluation over all possible matches is therefore O (LαLβ (Lα+Lβ) ) and is realised by computing (Tαβ13=l) and R(Tαβl-,=l) for all i ε Lα, and for all j ε Lβ, and substituting into (9) :
(23) u(Tαβl3 = l)= L(Tαβι3 = l) R(Tαβ13 = l) sub(aαι,aβ3)
A further saving in complexity can be achieved if the gap penalty function is linear since this allows for further recursion and reducing complexity of assessing all matches in a pair of strings to 0(LαLp) .
Returning to consider the upper bound on the all global solutions given an individual match in (7) . Taking logarithms leads to the formula for scoring hypothesised matches :
(24) S(n)(Tαβl3 =l)<=
Figure imgf000024_0002
γl=a,β max (n)αιγ s(n) (Tαγιk=l) +max kεL (n) γ3P s(n)γβk3=ι:
+ Σ γι Σ δ!=P,δcγ ma εL iγ, kεL (.36 S " (Tγδkl = l! where S=logU and s=logu, respectively and the suffix n has been introduced to indicate that we have an iterative process where the lists of plausible matches, L, and in consequence, the scores, s and S change over time. For example, as one match is eliminated, so this affects the left-hand and right-hand contributions to other matches . In this way, information about elimination propagates throughout the system.
In practice at the onset of processing only the first term in the sum might be used, i.e., only the score S(Tαβl]=l) would be computed, whilst all other terms would be set at their maximum. While this would lead to an overestimated upper bound, it would allow very many obviously bad matches to be eliminated. Over time other terms would be added as resources allow.
With reference to Figure 5, there is shown a schematic diagram of a computer system 500 according to the invention. The system includes a main processor, or processors 510, in communication with fast access memory 520 inlcuding RAM 524 and ROM 522 parts. The system includes input and output devices 530 and can be in communication with other computers vai a network interface 540. A mass storage device 550, such as a hard disk, is also provided for storing files including data to be processed and data that has been processed.
An aspect of the invention is a computer program implementing the method as described above. The details of a suitable computer program are considered to be within the ability of a man of ordinary skill in the art, in view of the aforegoing description of the method and so have not been described in any detail . A general outline of the significant procedural steps to be implemented by suitable computer program is provided below.
Read sequence data ax, , a
10 For each string α<=k {
For each string β> α {
For each character i<Lα { For each character j in αPl {
Compute s(Tαγ;L-,=l)
} } } }
For each string α<=k {
For each string β> α {
For each character i<Lα { For each character j in LαPl {
Compute S(Tαγι-,=l)
If S (Tαγι3=l) <threshold eliminate j from list LαPl
If change in lists go to 10
StandardRefineAlignment Data 552 representing the DNA strings to be matched is stored in a file 555 accessible by the processor. A file 557 is also provided for storing the final results of the processing such as the matching solution identified and an indication of the degree of match of the strings. The software controlling operation of the processor can also be stored on the mass storage device and in RAM used for short term storage of data during processing of the matching method.
The act of aligning two or more sequences together provides useful information on the sequences, provided that the sequences can be compared in a biologically meaningful manner .
Once constructed, a multiple sequence alignment, whether made of DNA or amino acids, can yield information simply not present in a single sequence. Such alignments can be used to compare a number of very similar sequences to see where they are similar and where they differ. Similarities may signify group characteristics responsible for similarities in protein structure and hence behaviour, whereas differences may signify mutations which lead to unwanted structure and behaviour, perhaps causing some genetic medical disorder, for instance. By identifying such mutations it may be possible to apply corrective measures to eliminate the mutations when present and hence the unwanted effects.
Multiple alignments can be also used as input to phylogenetic analysis programs, to study the evolutionary relationships between sequences, and between organisms. They can also pinpoint areas either particularly conserved or particularly divergent between related sequences . This in turn can yield information on the evolutionary processes undergone by those sequences .
Furthermore, such alignments at the protein level, when used as input to suitable protein modelling software, can help in understanding, and predicting, the structure of the protein in a way that individual sequences simply cannot do.
Although the method has been described with reference to matching strings of DNA bases, it will be appreciated that it can be applied to determining the best match between any sequences of signifiers which represent physical or non- physical entities. For instance, the method can be applied in the study of languages to determine the similarities between groups of words or characters in the same or different languages. The method can be applied to the analysis of powder diffraction patterns so as to determine the similarity of structures of materials by comparing the degree of match of representations of their diffraction patterns .
If only two strings of signifier are used, then the method can determine how closely the two match. If an entity and its representative sting are known, then the method can be used to identify an unknown entity by determining the match between its string with the string of the known entity. If the match is perfect, then the two entities will be identical. Provided the properties of an entity can be represented by a string of signifiers, the method can be used to compare two or more of the entities.

Claims

CLAIMS :
1. A method of determining the degree of match between a plurality of strings of signifiers, comprising the steps of: (i) identifying a possible signifier match between two signifiers in different strings;
(ii) determining an upper bound to the probability of any of the possible solutions containing the possible signifier match; (iii) comparing the upper probability bound for the possible solutions with a threshold probability; and
(iv) eliminating from the set of possible match solutions those solutions including the possible signifier match if the upper probability bound for the possible solutions is less than the threshold probability.
2. A method as claimed in claim 1, in which the upper bound is determined according to Bayesian probability theory.
3. A method as claimed in claim 1, in which the step of identifying a possible signifier match is repeated so as to identify all possible signifier matches between a first signifier in a first string and each signifier in each of the other plurality of strings.
4. A method as claimed in claim 3, in which the step of identifying a possible signifier match is repeated for each signifier in each of the plurality of strings so as to identify all possible signifier matches.
5. A method as claimed in claim 3 or claim 4, and including the step of repeating steps (iii) to (iv) for all the possible signifier matches identified.
6. A method as claimed in claim 1, in which only an identical signifier is a possible signifier match.
7. A method as claimed in claim 1, in which more than one signifier is a possible signifier match.
8. A method as claimed in claim 1, and including the step of recalculating the threshold.
9. A method as claimed in claim 1, and including the step of determining whether an acceptable number of possible solutions have been determined.
10. A method as claimed in claim 9, and including the step of recalculating the threshold if an acceptable number of possible solutions have not been determined.
11. A method as claimed in claim 1, in which at least two of the plurality of strings have the same number of signifiers.
12. A method as claimed in claim 11, in which each of the plurality of strings has the same number of signifiers.
13. A method as claimed in claim 1, in which each of the plurality of strings represents a sequence of DNA, or a sequence of amino acids that comprise a protein or proteins, or glycoconjugates .
14. A method as claimed in claim 1, in which each signifier represents a base.
15. A computer system for determining the degree of match between a plurality of strings of signifiers, comprising processing means, the processing means operating on data representing a plurality of strings of signifiers to:
(i) identify a possible signifier match between two signifier in different strings; (ii) determine an upper bound to the probability of any of the possible solutions containing the possible signifier match;
(iii) compare the upper probability bound for the possible solutions with a threshold probability; and (iv) eliminate from the set of possible match solutions those solutions including the possible signifier match if the upper probability bound for the possible solutions is less than the threshold probability.
16. A computer program executable on a computer to carry out a method as claimed in claim 1.
17. A computer program executable on a computer to provide a system as claimed in claim 15.
18. A computer readable medium bearing instructions executable on a computer to carryout a method as claimed in claim 1.
19. Computer program code, including instructions to carryout a method as claimed in claim 1.
PCT/GB2001/000631 2000-02-16 2001-02-16 Sequence matching WO2001061557A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001233858A AU2001233858A1 (en) 2000-02-16 2001-02-16 Sequence matching

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
PCT/GB2000/000492 WO2000049527A1 (en) 1999-02-19 2000-02-16 Matching engine
GBPCT/GB00/00492 2000-02-16
GB0020743A GB0020743D0 (en) 2000-08-23 2000-08-23 Sequence matching
GB0020743.1 2000-08-23

Publications (2)

Publication Number Publication Date
WO2001061557A2 true WO2001061557A2 (en) 2001-08-23
WO2001061557A3 WO2001061557A3 (en) 2003-12-04

Family

ID=26243371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2001/000631 WO2001061557A2 (en) 2000-02-16 2001-02-16 Sequence matching

Country Status (2)

Country Link
AU (1) AU2001233858A1 (en)
WO (1) WO2001061557A2 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5701256A (en) * 1995-05-31 1997-12-23 Cold Spring Harbor Laboratory Method and apparatus for biological sequence comparison
US5802525A (en) * 1996-11-26 1998-09-01 International Business Machines Corporation Two-dimensional affine-invariant hashing defined over any two-dimensional convex domain and producing uniformly-distributed hash keys

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5701256A (en) * 1995-05-31 1997-12-23 Cold Spring Harbor Laboratory Method and apparatus for biological sequence comparison
US5802525A (en) * 1996-11-26 1998-09-01 International Business Machines Corporation Two-dimensional affine-invariant hashing defined over any two-dimensional convex domain and producing uniformly-distributed hash keys

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RIGOUTSOS I ET AL: "DISTRIBUTED BAYESIAN OBJECT RECOGNITION" PROCEEDINGS OF THE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION. NEW YORK, JUNE 15 - 18, 1993, LOS ALAMITOS, IEEE COMP. SOC. PRESS, US, 15 June 1993 (1993-06-15), pages 180-186, XP000416313 *

Also Published As

Publication number Publication date
WO2001061557A3 (en) 2003-12-04
AU2001233858A1 (en) 2001-08-27

Similar Documents

Publication Publication Date Title
Jing et al. Learning from protein structure with geometric vector perceptrons
US11810648B2 (en) Systems and methods for adaptive local alignment for graph genomes
Kimothi et al. Distributed representations for biological sequence analysis
Liu et al. Bayesian inference on biopolymer models.
de Brevern et al. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks
CN100356392C (en) Post-processing approach of character recognition
Tavakoli Modeling genome data using bidirectional LSTM
CN109766469A (en) A kind of image search method based on the study optimization of depth Hash
Sonnenburg et al. Large scale genomic sequence SVM classifiers
Kilinc et al. Improved global protein homolog detection with major gains in function identification
Xiaohui et al. Predicting the protein solubility by integrating chaos games representation and entropy in information theory
US20050246317A1 (en) Matching engine
Rampone Recognition of splice junctions on DNA sequences by BRAIN learning algorithm.
Apostolico et al. Sequence similarity measures based on bounded hamming distance
CN115618096A (en) Inner product retrieval method and electronic equipment
CN114512178A (en) Codon optimization method based on Italian quantum annealing
US12125559B2 (en) Parallelizable sequence alignment systems and methods
Wu et al. Atomic protein structure refinement using all-atom graph representations and SE (3)-equivariant graph neural networks
Wang et al. Knockoff-Guided Feature Selection via A Single Pre-trained Reinforced Agent
Duy Nguyen et al. Multimodal pretraining for unsupervised protein representation learning
CN113779183B (en) Text matching method, device, equipment and storage medium
US9390163B2 (en) Method, system and software arrangement for detecting or determining similarity regions between datasets
CN112085245A (en) Protein residue contact prediction method based on deep residual error neural network
Nsira et al. A fast Boyer-Moore type pattern matching algorithm for highly similar sequences
Deorowicz et al. Kalign-LCS—a more accurate and faster variant of Kalign2 algorithm for the multiple sequence alignment problem

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP