WO2001026044A1 - A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing - Google Patents
A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing Download PDFInfo
- Publication number
- WO2001026044A1 WO2001026044A1 PCT/CA2000/001107 CA0001107W WO0126044A1 WO 2001026044 A1 WO2001026044 A1 WO 2001026044A1 CA 0001107 W CA0001107 W CA 0001107W WO 0126044 A1 WO0126044 A1 WO 0126044A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- tree
- const
- trees
- noisy
- edit
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/196—Recognition using electronic means using sequential comparisons of the image signals with a plurality of references
- G06V30/1983—Syntactic or structural pattern recognition, e.g. symbolic string recognition
- G06V30/1988—Graph matching
Definitions
- the present invention relates to methods for pattern recognition, wherein the identity of the parent can be determined from a "noisy" fragment thereof. Subject matter that is identifiable by this method is such that it can be represented by a tree notation.
- Trees are a fundamental data structure in computer science.
- a tree is, in general, a structure which stores data and it consists of atomic components called nodes and branches.
- the nodes have values which relate to data from the real world, and the branches connect the nodes so as to denote the relationship between the pieces of data resident in the nodes.
- no edges of a tree constitute a closed path or cycle. Every tree has a unique node called a "root".
- the branch from a node toward the root points to the "parent" of the said node.
- the branch of the node away from the root points to the "child” of the said node.
- the tree is said to be ordered if there is a left-to-right ordering for the children of every node.
- Trees have numerous applications in various fields of computer science including artificial intelligence, data modeling, pattern recognition, and expert systems. In all of these fields, the trees structures are processed by using operations such as deleting their nodes, inserting nodes, substituting node values, pruning sub-trees from the trees, and traversing the nodes in the trees. When more than one tree is involved, operations that are generally utilized involve the merging of trees and the splitting of trees into multiple subtrees. In many of the applications which deal with multiple trees, the fundamental problem involves that of comparing them.
- Trees, graphs, and webs are typically considered as a multidimensional generalization of strings. Among these different structures, trees are considered to be the most important
- the tree-editing problem concerns the determination of the distance between two trees as measured by the minimum cost sequence of edit operations.
- the edit sequence considered includes the substitution, insertion, and deletion of nodes needed to transform one tree into the other.
- the measure ⁇ was used to define various numeric quantities between Ti and T2 including (i) the edit distance between two trees, (ii) the size of their largest common sub-tree, (iii) Prob(T2
- this invention provides a method of comparing the closeness of a target tree to other trees located in a database of trees, said method comprising the steps of: (a) calculating a constraint in respect of each tree in the database based on an estimated number of edit operations and a characteristic of the target tree; (b) calculating a constrained tree edit distance between the target tree and each tree in the database using the constraint obtained in step (a); and (c) comparing the calculated constrained tree edit distances.
- this invention provides a method of matching a target tree representable structure to its closest tree representable structure, said method comprising the steps: (a) generating one or more target trees for a target structure; (b) calculating a constraint in respect of each tree in the database based on an estimated number of edit operations and a characteristic of the target tree; (c) calculating a constrained tree edit distance between the target tree and each tree in the library using the constraint obtained in step (b) and the intersymbol edit distance; (d) comparing the calculated constrained tree edit distances; and (e) reporting the tree in the database that has the smallest constrained tree distance.
- the method of this invention comprises a series of nested algorithms.
- a schematic representation of the overall algorithm is presented in Figure 8. This algorithm invokes algorithms for each of which schematic representations are presented in Figures 9 — 18.
- Figure 1 presents an example of a tree X*, U, one of its Subsequence Trees, and Y which is a noisy version of U.
- the noisy Subsequence Tree (NSuT) Recognition problem involves recognizing X* from Y.
- Figure 2 presents an example of the insertion of a node in a tree.
- Figure 3 presents an example of the deletion of a node in a tree.
- Figure 4 presents an example of the substitution of a node by another in a tree.
- Figure 5 presents an example of a mapping between two labeled ordered trees.
- Figure 6 demonstrates a tree from the finite dictionary H. Its associated list representation is as follows:
- Figure 7 presents the left-to-right postorder tree representation of a list obtained from a string.
- Figure 8 presents a schematic diagram showing the Process RecognizeSubsequenceTrees used to solve the noisy Subsequence Tree Recognition Problem.
- the input comprises (1) the finite dictionary, H, (2) Y, a noisy version of a subsequence tree of an unknown X* in H, and (3) L, the expected number of substitutions in Y.
- the output comprises the estimate X + of X*. If L is not a feasible value L p is the closest feasible integer.
- the set of elementary edit distances ⁇ d(.,.) ⁇ is assumed global.
- Figure 9 is a schematic diagram showing the Process Constrained Tree Distance.
- the input comprises the array Const_T_Wt[.,.,.] computed using Process T Weights and constraint ⁇ given as a set of the number of substitutions used in the constrained editing process.
- the output comprises the constrained distance D ⁇ (T T 2 ).
- FIG 10 is a schematic diagram showing the Process TfWeights.
- the input comprises Trees T, and T 2 and the Set of Elementary Edit Distances.
- the output comprises
- FIG 11 is a schematic diagram showing the Process Preprocess For TWeights.
- the input comprises Trees T, and T 2 .
- the outputs are the ⁇ [] and Essential_Nodes[] for both trees.
- Figure 12 is a schematic diagram showing the Process Compute Const _T_ Wt.
- the input comprises the indices i, j and the quantities assumed global in T Weights.
- the output comprises the array Const_TWt[ i ⁇ ,),,s , b ⁇ x ( ⁇ ) ⁇ i, ⁇ i, ⁇ 2 (j) ⁇ ji ⁇ j, 0 ⁇ s ⁇ Min ⁇ Size(i), Size ( J) ⁇ .
- Figure 13 is a schematic diagram showing the steps of the Process Compute _Const_T_Wt subsequent to those shown in Figure 12.
- Figure 14 is a schematic diagram showing the steps of the Process Compute Const _T_Wt subsequent to those shown in Figure 13.
- Figure 15 is a schematic diagram showing the steps of the Process Compute _Const_T_ Wt subsequent to those shown in Figure 14.
- Figure 16 is a schematic diagram showing the steps of the Process Compute Const _T_ Wt subsequent to those shown in Figure 15.
- Figure 17 is a schematic diagram showing the steps of the Process Compute _Const_T_Wt subsequent to those shown in Figure 16.
- Figure 18 is a schematic diagram showing the steps of the Process Compute Const _T_Wt subsequent to those shown in Figure 17.
- Figure 19 is a schematic diagram showing how the invention can be used in the recognition of Ribonucleic Acids (RNA) molecules from their noisy fragments. Since an RNA molecule can be directly represented as a tree structure, the recognition of the RNA molecule from its fragment is a straightforward application of the solution to the NSuT problem.
- RNA Ribonucleic Acids
- Figure 20 is a schematic diagram showing how the invention can be used in the recognition of chemical compounds, represented in terms of their molecules, from their noisy fragments. Since chemical compounds are drawn as graphs, each compound is first mapped into a set of representative tree structures. Similarly, the noisy fragment of the compound is also mapped into a set of representative tree structures. The compound recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each compound and the tree representations of the noisy fragment.
- Figure 21 is a schematic diagram showing how the invention can be used in the recognition of chemical compounds, represented in terms of their atomic structure, from their noisy fragments. Since chemical compounds are drawn as graphs, each compound is first mapped into a set of representative tree structures, where the nodes are the atoms. Similarly, the noisy fragment of the compound is also mapped into a set of representative tree structures. The compound recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each compound and the tree representations of the noisy fragment.
- FIG 22 is a schematic diagram showing how the invention can be used in the recognition of fingerprints.
- the finge ⁇ rints are characterized by their minuatae.
- the recognition is achieved from a noisy portion of the finge ⁇ rint sought for. Since numerous minuatae representations of each finge ⁇ rints are possible, each finge ⁇ rint is first mapped into a set of representative tree structures. Similarly, the noisy fragment of the finge ⁇ rint is also mapped into a set of representative tree structures.
- the finge ⁇ rint recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each finge ⁇ rint and the tree representations of the noisy fragment.
- Figure 23 is a schematic diagram showing how the invention can be used in the recognition of maps.
- the recognition is achieved from a noisy portion of the map sought for. Since numerous tree representations of each map are possible, each map is first mapped into a set of representative tree structures. Similarly, the noisy fragment of the map sought for is also mapped into a set of representative tree structures.
- the map recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each map and the tree representations of the noisy fragment.
- Figure 24 is a schematic diagram showing how the invention can be used in the recognition of electronic circuitry.
- the recognition is achieved from a noisy portion of an electronic circuit sought for.
- the nodes in this case are the various electronic components such as resistors, diodes, transistors, capacitors etc. Since numerous tree representations of each electronic circuit are possible, each electronic circuit is first mapped into a set of representative tree structures. Similarly, the noisy fragment of the electronic circuit sought for is also mapped into a set of representative tree structures.
- the electronic circuitry recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each electronic circuit and the tree representations of the noisy fragment.
- Figure 25 is a schematic diagram showing how the invention can be used in the recognition of flow charts.
- the recognition is achieved from a noisy portion of a flow chart sought for.
- the nodes in this case are the various symbols used in flow charting such as assignments, loops, comparisons, control structures etc. Since numerous tree representations of each flow chart are possible, each flow chart is first mapped into a set of representative tree structures. Similarly, the noisy fragment of the flow chart sought for is also mapped into a set of representative tree structures.
- the flow chart recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each flow chart and the tree representations of the noisy fragment.
- Figure 26 presents the "confusion matrix" (Table I) with the probabilities of substituting a character with another character.
- the figures in the table are to be multiplied by a factor of 10
- Figure 27 presents Table II displaying examples of the original trees, the associated subsequence trees and their noisy versions.
- Figure 28 presents Table III, describing a subset of the trees used for Data Set A and their noisy subsequence trees.
- the trees and subsequence trees are represented as parethensized lists.
- Figure 29 presents Table VI, describing a subset of the trees used for Data Set B and their noisy subsequence trees.
- the trees and subsequence trees are represented as parenthesized lists.
- the original unparenthesized strings are the same as those used in Oornrnen, IEEE Trans. Pattern Anal. And Mach. Intell., Vol. PAMI 9, No. 5:676-685, (1987) and were obtained from Hall and Dowling, Comput. Sur., Vol 12:381-402 (Dec 1980).
- Figure 30 presents a typical example of a bacterial phylogenetic tree displaying the differences between Bacteria and Archaea.
- This invention provides a method of comparing the closeness of a target tree to other trees, wherein the target tree can optionally be a noisy sub-fragment of the other trees.
- the tree is provided by a user and the trees to be compared are located in a database.
- the invention utilizes the process of constrained tree editing to tree structures derived from the target tree and at least one tree representation of every structure stored in the database.
- the method can also be applied to strings, wherein a string is considered a tree in which each parent node has exactly one child.
- the method of this invention is based on the assumption that there is some connection between the target tree and one or more trees located in the database.
- the target could be unrelated, but similar, it could be a subfragment of a parent tree located in the database, or it could be a noisy subfragment of a parent located in the database.
- a string can be considered as a tree in which each parent node as exactly one child, the method can also be applied to string problems by representing the string as a tree.
- the versatility of the method of this invention derives from the fact that the Noisy Subsequence Tree Recognition problem is applied in each of these circumstances to compare the closeness of a target tree to other trees located in a database of trees.
- this invention provides a method of comparing the closeness of a target tree to other trees located in a database of trees, said method comprising the steps of: (a) calculating a constraint in respect of each tree in the database based on an estimated number of edit operations and a characteristic of the target tree; (b) calculating a constrained tree edit distance between the target tree and each tree in the database using the constraint obtained in step (a); and (c) comparing the calculated constrained tree edit distances.
- this invention provides a method of matching a target tree representable structure to its closest tree representable structure, said method comprising the steps: (a) generating one or more target trees for a target structure; (b) calculating a constraint in respect of each tree in the database based on an estimated number of edit operations and a characteristic of the target tree; (c) calculating a constrained tree edit distance between the target tree and each tree in the library using the constraint obtained in step (b) and the intersymbol edit distance; (d) comparing the calculated constrained tree edit distances; and (e) reporting the tree in the database that has the smallest constrained tree distance.
- RNA ribonucleic acid
- each item is first mapped into a set of representative tree structures. These tree representations are stored in a database means.
- the noisy fragment of the item sought for is also mapped into a set of representative tree structures.
- the overall pattern recognition is achieved by invoking the solution to the NsuT problem between the various tree representations of each item and the tree representations of the noisy fragment for which the parent identity is being sought. Since the graph-to-tree manipulations are straightforward, the key of the invention involves the solution of the Noisy Subsequence Tree Recognition Problem described below which involves recognizing a tree, X*, which is an element of a dictionary or database of trees. The latter recognition is achieved by processing the information contained in a tree, Y, which in turn, is a noisy (garbled or inexact) version of U, one of subsequence trees of X*.
- Oornrnen (IEEE Trans. Pattern Anal, and Mach. Intell., PAMI 9: pp. 676-685 (1987)) devised an algorithm for the recognition of noisy subsequence from strings, which was achieved by evaluating the inter-string constrained edit distance.
- the results reported for solving the NsuT problem are not mere extensions of the corresponding string editing and recognition problem. This is because, unlike in the case of strings, the topological structure of the underlying graph prohibits the two-dimensional generalizations of the corresponding computations. Indeed, inter-tree computations require the simultaneous maintenance of meta-tree considerations represented as the parent and sibling properties of the respective trees, which are completely ignored in the case of linear structures such as strings. This further justifies the intuition that not all "string properties" generalize naturally to their corresponding "tree properties".
- the present invention has vast and enormous applications in problems which involve strings, substrings and subsequences.
- the current invention addresses the problem of recognizing trees by processing the information resident in their (noisy) subsequence trees. But if it is observed that a string is itself a tree in which each parent node has exactly one child, the current invention can be directly applied to the corresponding problems involving strings.
- the mappings between the problems from the tree-domain to the string-domain are straightforward, the following examples (in post-order notation) are catalogued so as to clarify their instantiations.
- the present invention essentially represents a single generic solution for all (noisy) string, substring and subsequence recognition algorithms, while it simultaneously can be used as a generic solution to all (noisy) tree, subtree and subsequence tree recognition problems.
- the invention pertains to the recognition of subject matter which can be described as a planar or non-planar graph in two dimensions using nodes and edges. Items constituting such subject matter are called Tree Representable Structures.
- a "tree representable structure", as referred to herein, is any structure which can be represented using nodes and edges in a tree structure.
- Each item of this subject matter can be represented in a tree structure by extracting from the graph an underlying spanning tree as explained in Aho et al. (The Design and Analysis of Computer Algorithms, Addison Wesley, Reading : MA, (1974)) and by Cormen et al. (Introduction to Algorithms, The MIT Press, Cambridge : MA, (1989)).
- the items do not need not be two dimensional. Rather, they must be representable as two-dimensional graphs which may be planar or non-planar.
- the parent of any "noisy" fragment of any of these tree structures can be identified using the method of this invention.
- Examples of items that can be described in two-dimensions are map-like structures such as RNA molecules or parts thereof, plans, designs, chemical compounds described in their molecular structures, chemical compounds described in their atomic structures, drawings, electronic circuits, finge ⁇ rints, and flowcharts.
- the recognition of all of these items in their particular application domain utilizes the solution of the noisysy Subsequence Recognition (NsuT) problem described presently, which indeed constitutes a central kernel of the invention.
- NsuT noisy Subsequence Recognition
- a two-dimensional tree representation is readily generated from a general pattern, or structure.
- a preprocessing system would typically extract the characterizing features which would be used in the recognition.
- the features in this application domain are referred to as the minutiae.
- the relationship between the minutiae can be represented using edges.
- the resulting structure is a tree in which the nodes are the minutiae themselves and the edges represent the proximity between them.
- a pattern would be string-representable if the pattern can be described as a sequence of individual symbols in a left-to-right manner or a right-to-left manner. If, apart from such a linear description, the sequence of symbols also possesses a parent-child relationship, the pattern is tree-representable. It should be observed that whereas a string-representable pattern obeys a sibling property between the symbols, a tree-representable pattern would additionally possess both a sibling property and a parent-child property.
- a graph is a two-dimensional structure consisting of nodes (also called vertices) and edges.
- the edges can be represented as lines between the nodes.
- Each node possesses a node content which consists of its identity and a value that it contains.
- the graph can be stored in terms of its adjacency matrix, which is a two-dimensional matrix which has an entry in its ⁇ i, j> position to indicate that there is an edge between nodes i and j.
- the graph can alternatively be stored in terms of an adjacency list for every node, which list is a one- dimensional list stored for every node.
- the list for node i has an entry j to indicate that there is an edge between nodes i and j in the graph.
- a preferred method for deriving a spanning tree from a graph entails starting from an arbitrary node, (for example, i,) and mark it as a "visited” node. The method would then recursively visit all the adjacent nodes of the most recently visited node, which have not been marked "visited” yet, and retain the edge between them. The tree would consist of the entire set of nodes and the retained edges.
- Trees can be generated for different perspectives of subject matter that is a three- dimensional structure. Obviously this factor could generate an almost infinite number of permutations for comparing a target tree to the database of trees.
- This problem can be addressed by considering that the problem of determining an appropriate tree representation for a tree-representable structure is problem dependant. It is by no means trivial. Since each item can be mapped into numerous spanning trees, and each target can, in turn, be mapped into numerous spanning trees, the optimal recognition based on the criteria of minimizing the constrained edit distance would involve a tree representation of the target item to all the tree representations of the items. Since the number of tree representations of the item is prohibitively large, it is therefore expedient to use just a representative set of so-called perspective trees for each pattern.
- the subset of trees that can be chosen to represent an item can be chosen using any criterion.
- One criterion could be to use the representation which are most "stringy" - in which each node has the minimum number of children.
- Other criteria could involve the representation that is the maximum/minimum spanning tree of the graph, where the edge-weights could be the functions of the node values themselves.
- An alternate method for achieving the pattern recognition would be that of comparing a single tree representation of the target with a small subset of tree representations of the items. If the associated constrained distance between a tree representation of a item and the representation of the target is greater than a user-specified threshold value, the computation would request for a new tree representation of the target.
- Generating a dictionary from a set of tree structures is a straightforward task which is well known to one skilled in the art, and which involves storing the trees in their left-to-right post-order parenthesized representations as explained in Aho et al. (The Design and Analysis of Computer Algorithms, Addison Wesley, Reading : MA, (1974)) and by Cormen et al. (Introduction to Algorithms, The MIT Press, Cambridge : MA, (1989). A tree and its corresponding left-to-right postorder tree representation are given in Figure 6.
- the stored dictionary would contain the parenthesized representations of all the trees.
- the invention also considers other straightforward data structure representations of trees.
- the invention also considers the trivial extension where the right-to-left postorder ordering of the nodes of the tree is used instead of its left-to-right postorder ordering. 77ze Solution to NsuT Recognition
- the methodology involves sequentially comparing Y with every element X of H, the basis of comparison being the constrained edit distance between two trees as defined by Oornrnen and Lee (Information Sciences, Vol. 77, pp. 253-273 (1994)).
- the actual constraint used in evaluating the constrained distance can be any arbitrary edit constraint involving the number and type of edit operations to be performed.
- a specific constraint which implicitly captures the properties of the corrupting mechanism ("channel") which noisily garbles U into Y is used.
- the algorithm which inco ⁇ orates this constraint has been used to test the pattern recognition system yielding a remarkable accuracy.
- Experimental results for the NsuT recognition problem which involve manually constructed trees of sizes between 25 and 35 nodes and which contain an average of 21.8 errors per tree demonstrate that the scheme has about 92.8% accuracy. Similar experiments for randomly generated trees yielded an accuracy of 86.4%.
- N be an alphabet and N* be the set of trees whose nodes are elements of N.
- ⁇ be the null tree, which is distinct from ⁇ , the null label not in N.
- T e N is said to be of size
- M.
- the tree will be represented in terms of the left-to-right postorder numbering of its nodes. The advantages of this ordering are catalogued by Zhang and Shasha, (SIAMJ. Comput. (1989)).
- the invention also considers the trivial extension where the right-to-left postorder numbering of the nodes of the tree is used instead of the left-to-right postorder numbering of the nodes.
- T[i] be the i m node in the tree according to the left-to-right postorder numbering, and let ⁇ (i) represent the postorder number of the leftmost leaf descendant of the subtree rooted at T[i].
- ⁇ (i) i.
- T[i..j] represents the postorder forest induced by nodes T[i] to T[j] inclusive, of tree T.
- T[ ⁇ (i)..i] will be referred to as Tree(i).
- Size(i) is the number of nodes in Tree(i).
- the father of i is denoted as f(i).
- f°(i) i
- An edit operation on a tree is either an insertion, a deletion or a substitution of one node by another.
- node x will be inserted as a son of some node u of T. It may either be inserted with no sons or take as sons any subsequence of the sons of u. If u has sons u ⁇ ,u2,..,u
- ⁇ , then for some 0 ⁇ i ⁇ j ⁇ k, node u in the resulting tree will have sons ui,...,ui, x, uj,...,uk, and node x will have no sons if j i+1, or else have sons ui+ ⁇ ,...,uj-i.
- This edit operation is shown in Figure 2.
- ⁇ ⁇ will represent the cost of deletion and insertion of node x and y respectively.
- the distances d(.,.) obey :
- D(T ⁇ ,T2) Min ⁇ W(S)
- S is an S-derivation transforming Ti to T2>.
- mapping between trees is a description of how a sequence of edit operations transforms Ti into T2.
- a pictorial representation of a mapping is given in Figure 5. Informally, in a mapping the following holds :
- mapping is a triple (M,T ⁇ ,T2), where M is any set of pairs of integers (i,j) satisfying :
- T ⁇ [i ⁇ ] is an ancestor of T ⁇ [i2] if and only if T2U1] is an ancestor of T2 2] (the Ancestor Property).
- mappings can be composed to yield new mappings (see Tai (J. ACM, Vol 26, pp 422-433 (1979)), and Zhang and Shasha (SIAMJ. Comput. Vol. 18, No. 6 : pp. 1245-1262 (1989))), the relationship between a mapping and a sequence of edit operations can now be specified.
- Figure 8 commences in 100 with the input being presented.
- the input first of all, consists of the Dictionary, H. It also includes L, the expected number of feasible substitutions caused in the garbling process for the particular problem domain. It finally includes the noisy subsequence tree, Y, which is used to determine its parent whence its ungarbled version was obtained.
- a decision is first made in block 110 determining if there are any more trees in H. If the answer to this query is "no", the estimate X + of X * is printed in block 180. If there are more trees in H, control is given to block 120, where an assignment to X is made of the next tree in H.
- L is a feasible value. If it is not, then the closest feasible integer to L is assigned into L p . This occurs in block 150. If the decision from 130 is "yes", then L p is assigned the value L at block 140. Another assignment is made in block 160 to ⁇ , which is assigned to be a small set of integers around L p . In the absence of any other information, the best estimate of the number of substitutions that could have taken place is indeed its expected value, L, which is usually close to the size of the NSuT, Y. In the examples shown in this submission, this is set to be
- PROCESS RecognizeSubsequenceTrees Input 1.
- L the expected number of substitutions that took place in the transmission. In the examples shown in this submission, this is set to be
- Endlf ⁇ Superset of ⁇ L ⁇ marginally larger than ⁇ L]
- End X + is the tree minimizing D ⁇ (X,Y)
- the Process RecognizeSubsequenceTrees invokes the Process Constrained Tree Distance shown in Figure 9.
- Figure 9 starts off in block 200 by reading in the array
- Process T Weights The Process Constrained Tree ⁇ Distance first invokes the Process T_ Weights shown in Figure 10.
- Figure 10 begins with an Input/Output block (block 300) where the Trees T, and T 2 and the set of elementary edit distances are read into the system. It then invokes the
- Input Trees Ti and T2 and the set of elementary edit distances.
- Output Const_T_Wt[i, j, s], 1 ⁇ i ⁇
- Process Preprocess The Process T_Weights first invokes the Process Preprocess shown in Figure 11.
- Figure 11 starts off with a sequence of Input/Output operations in block 400 where both the trees
- T, and T 2 are read in. Subsequently, in block 410 the ⁇ [] and Essential_Nodes[] for both trees are calculated. Finally, these two variables are stored back into the system in block 420, before returning in 430.
- the Process Preprocess is so straightforward and so its formal algorithmic description is omitted.
- Figure 12 commences in block 500 with an Input/Output operation where the indices i and j and the quantities assumed global in the Process T_ Weights are read in.
- a decision in block 520 is invoked to determine if x, is less than or equal to N - this is the beginning of a loop. If it is not, control is passed to block 560, where y, is initialized to be 1. Control is then passed to block 590, which leads to the next phase of this method. If the decision in block 520 is "yes", the assignments :
- Figure 13 continues where Figure 12 left off, starting with a decision in block 600, where a test is invoked to see if y, is less than or equal to M. This initiates another loop. If it is not, control is passed on to block 640 explained presently.
- Const_F_Wt [0] [y,] [0] Const_F_Wt [0] [ y, - 1] [0] + d( ⁇ T 2 [y, + b 2 ]), and
- Const_T_Wt [0][y, + b 2 ][0] Const_F_Wt [0][y,][0] in blocks 610 and 620 respectively are done.
- the process carries out an assignment of s to 1 in block 640. This is followed by the decision to test if s is less than or equal to R in block 650 - which constitutes the entry of another loop. If the answer to block 650 is "no", control is passed to block 690, which is the next phase of this process. If.
- Const_F_Wt [0][0][s] oo
- Const_T_Wt [0] [0] [s] Const F Wt [0] [0] [s] .
- Figure 14 further develops the Process (Compute _Const_T_ Wt) with an assignment of x, to 1 in block 700. Thereafter the Process makes a decision in block 710 to test if x, is less than or equal to N - which initializes a loop. If the decision in block 710 is "no", control is passed to block 780 (which is an assignment of x, to 1) which then passes control to the next phase of the method, block 790. If, on the other hand, the decision to block 710 is
- the outer loop concludes each iteration by incrementing y, in block 760 before control being passed back to block 730.
- Figure 14 continues to the subsequent operations of the Process in Figure 15.
- the Process initiates another looping decision block at block 830, which queries if s is less than or equal to R. If the answer to the query in block 830 is "no”, control is passed to block 870, where x, is increased by one and the control flows back to block 800. If the answer to the query in block 830 is "yes”, control is passed to blocks 840 and 850 with the following assignments :
- Const_F_Wt [x,][0][s] ⁇
- Const_T_Wt [x, + b,][0][s] Const_F_Wt [x,][0][s]. This is followed by incrementing s before control is passed back to block 830. The Process continues in Figure 16.
- Figure 16 continues the process where Figure 15 left off.
- FIG. 17 and 18 describe parallel sections of the same process, and so they are described together.
- the process first executes block 1000, which evaluates the question "is x, less than or equal to N?". This block initates a loop. If the answer to this query is "no", the Process traverses link 13, and proceeds to block 1140, which is the Input/Output block which stores the value of Const_T_Wt [i to j household s] for i, being between ⁇ ,(i) and i, j, being between ⁇ 2 (j) and j, and finally s being between 0 and the minimum of [Size(i), Size(j) ].
- the control passes to the final return block of the method, which is block 1150.
- y is initialized to be 1 in block 1010 before entering another loop.
- the question in the decision block 1020 tests if y, is less than or equal to M. If the answer to block 1020 is negative, the Process traverses link 11 and proceeds to block 1130 of Figure 18, which increments y, before backtracking linkl2 to block 1000. If the answer to blockl020's question was in the affirmative, the Process proceeds to block 1030, where s is set to 1 and where the Process resolves another decision at block 1040.
- the question asked at block 1040 is : "Is s less than or equal to R?".
- Const_F_Wt [xiHyiHs] Min Const_F_Wt [xi-l][yi-l][s-l]
- Const_T_Wt [xi+bi][yi+b2][s] Const_F_Wt [xi][yi][s].
- Const_T_Wt [x, + b,] [y, + b 2 ][s] is assigned to take the value of Const_F_Wt [x,][ y,][s].
- Control passes to block 1110, described earlier, through link 6. This completes the description of this figure, and the entire Process.
- Input Indices i, j and the quantities assumed global in T_Weights.
- N i - ⁇ i (i) + 1 ; /* size of subtree rooted at Ti [i] */
- Const_F_Wt [0] [0] [0] 0; /* Initialize Const_F_Wt */
- Const F Wt [xi ][yi][0] Min ⁇ ⁇ _ r 1 i r ⁇ r -1 , ._ . . . .
- Const_T_Wt [xi+b ⁇ ][y ⁇ +b2][s] Const_F_Wt [xi][yi][s]; Else
- the Process RecognizeSubsequenceTrees assumes that the constrained distance subject to a specified constraint set, ⁇ , can be computed. Since ⁇ is fully defined in terms of the number of substitutions required in the comparison, all the information required for the comparison will be available in the array Const_T_Wt [.,.,.] computed using Process T Veights. Thus, after the array Const_T_Wt [ 1 is computed, the distance D ⁇ (Ti,T2) between Ti and T2 subject to the constraint ⁇ can be directly evaluated using the Process Constrained Tree Distance, which essentially minimizes Const_T_Wt over all the values of 's' found in the constraint set.
- the user has to specify the inter-symbol distances. These distances are typically symbol dependant. The distance associated with an operation decreases with the likelihood of the corresponding error occurring. Thus if it is very likely that the symbol 'a' in the alphabet can be misrepresented by a 'b', it would mean that the distance d(a, b) is correspondingly small. These probabilities (or likelihoods) are called confusion probabilities, and the inter-symbol distances are usually specified in terms of the negative log-likelihood of one symbol being transformed into (misrepresented by) another. In the absence of such likelihood information, traditional 0/1 distances for equal/non-equal symbols can be utilized.
- the distances can be learnt by using a training set of noisy samples whose identities are known. This process is called "training" and is necessary in all pattern recognition problems.
- One possible method of training is explained as follows. The distances associated with deletion and insertion are first set to unity. The distance associated with an equal substitution is then set to zero. Finally, the distance associated with a non-equal substitution is set to a value 'r'. The value of 'r' is chosen so as to maximize the probability of recognition of the samples in the training set. This is easily done in the case of strings as explained by Oornrnen and Loke (Proceedings of the 1997 IEEE International Conference on Systems, Man and Cybernetics (1997)). In the case of trees this would otherwise require a search in the space of values of V in the range [0, 2], and so a straightforward search in this space using a desired resolution would yield a suitable value for 'r'.
- H e ⁇ j
- H s ⁇ j
- Hi, H e , and H s are called the set oi permissible values of i, e, and s.
- Theorem I specifies the feasible triples for editing T ⁇ [l..r] to T2 ⁇ l..q].
- An edit constraint is specified in terms of the number and type of edit operations that are required in the process of transforming Ti to T2. It is expressed by formulating the number and type of edit operations in terms of three sets Qi, Q e , and Q s which are subsets of the sets Hi, H e , and H s defined above.
- Every edit constraint specified for the process of editing Ti to T2 is a unique subset of H s .
- the number of substitutions permitted is Q s n Qe* ⁇ Qi* ⁇ H s .
- the optimal transformation must contains exactly 5 substitutions.
- Const_T_Wt(i,j,s) Const_F_Wt(T ⁇ [ ⁇ (i)..i],T2[ ⁇ (j)..j],s).
- Const_F_Wt(T ⁇ [ ⁇ (ii)..i], ⁇ ,0) Const_F_Wt(Ti[ ⁇ (i ⁇ )..i-l], ⁇ ,0) + d(Ti[i], ⁇ ).
- Const_F_Wt( ⁇ ,T 2 [ ⁇ (jl)..j],0) Const_F_Wt( ⁇ ,T2t ⁇ (Ji)..j-l],0) +
- Const_F_Wt(T ⁇ [ ⁇ (ii)..i], ⁇ ,s) ⁇ if s > 0.
- Const_F_Wt( ⁇ ,T2[ ⁇ (jl).j],s) ⁇ if s > 0.
- Lemma II essentially states the properties of the constrained distance when either s is zero or when either of the trees is null. These are “basis" cases that can be used in any recursive computation. For the non-basis cases, the scenarios when the trees are non-empty and when the constraining parameter, s, is strictly positive are considered. Theorem III gives the recursive property of Const_F_Wt in such a case.
- Const_F_Wt(T ⁇ [ ⁇ (ii)..i],T 2 [ ⁇ (jl)..j],s) Const_F_Wt(Ti[ ⁇ (i ⁇ )..i-l],T 2 [ ⁇ (jl)..j],s) + d(T ⁇ [i], ⁇ ). (ii) If T2Jj] is not touched by any line in M, then T2[j] is to be inserted. Again, since the number of substitutions in Const_F_Wt(.,.,.) remains unchanged, the following is true:
- Const_F_Wt(T ⁇ [ ⁇ (ii)..i],T2[ ⁇ (Jl)..j],s) Const_F_Wt(T ⁇ [ ⁇ (ii)..i],T2[ ⁇ (j ⁇ )..j-l],s) + d( ⁇ ,T 2 [j])- (iii)
- (i,k) and (h,j) be the respective lines, i.e. (i,k) and (h,j) e M. If ⁇ (i ⁇ ) ⁇ h ⁇ ⁇ (i)- 1, then i is to the right of h and so k must be to the right of j by virtue of the sibling property of M. But this is impossible in T2[ ⁇ (ji)..j] since j is the rightmost sibling in T2[ ⁇ ( jl)..j]. Similarly, if i is a proper ancestor of h, then k must be a proper ancestor of j by virtue of the ancestor property of M. This is again impossible since k ⁇ j. So h has to equal to i.
- S2 can take any value between 1 to Min ⁇ Size(i),Size(j),s ⁇ .
- Theorem III naturally leads to a recursive method, except that its time and space complexities will be prohibitively large.
- the main drawback with using Theorem III is that when substitutions are involved, the quantity Const_F_Wt(T ⁇ [ ⁇ (ii)..i],T2[ ⁇ (j ⁇ )..j],s) between the forests T ⁇ [ ⁇ (i ⁇ )..i] and T2 ⁇ . ⁇ (j ⁇ )..j] is computed using the Const F Wts of the forests T ⁇ [ ⁇ (il).. ⁇ (i)-1] and T2[ ⁇ (jl).. ⁇ (j)-1] and the Const_F_Wts of the remaining forests T ⁇ [ ⁇ (i)..i-1] and T2[ ⁇ (j)..j-1].
- Const_F_Wt(T ⁇ [ ⁇ (i ⁇ )..i],T2[ ⁇ (jl)..j],s) can be considered as a combination of the Const_F_Wt(T ⁇ [ ⁇ (i ⁇ ).. ⁇ (i)-l], T " 2[ ⁇ (jl).. ⁇ (j)-l],s-s2)) and the tree weight between the trees rooted at i and j respectively, which is Const_T_Wt(i,j,s2). This is proved below.
- Const_T_Wt(i,j,s2) ⁇ Const_F_Wt(Ti[ ⁇ (i)..i-l],T2[ ⁇ (j)..j-l],s2-D + d(T ⁇ [i],T2 ⁇ j])-
- Const_T_Wt(i,j,s2) for the corresponding Const_F_Wt expressions, and the result follows.
- the details of the proof are found in Oornrnen and Lee (Information Sciences, Vol. 77, pp. 253-273 (1994)).
- Theorem IV suggests that a dynamic programming flavored method can be used to solve the constrained tree editing problem.
- the second part of Theorem IV suggests that to compute Const_T_Wt(i ⁇ ,ji,s), the quantities Const_T_Wt(i,j,s2) must be available for all i and j and for all feasible values of 0 ⁇ S2 ⁇ s, where the nodes i and j are all the descendants of ii and jl except nodes on the path from ii to ⁇ (i ⁇ ) and the nodes on the path from ji to ⁇ (ji).
- the theorem also asserts that the distances associated with the nodes which are on the path from ii to ⁇ (ii) get computed as a by-product in the process of computing the Const_F_Wt between the trees rooted at ii and jl. These distances are obtained as a by-product because, if the forests are trees, Const F Wt is retained as a Const_T_Wt.
- the set of nodes for which the computation of Const_T_Wt must be done independently before the Const_T_Wt associated with their ancestors can be computed is called the set of Essential_Nodes, and these are merely those nodes for which the computation would involve the second case of Theorem IV as opposed to the first.
- there exists no k' > k such that ⁇ (k) ⁇ (k') ⁇ .
- Const T Wt can be computed for the entire tree if Const_T_Wt of the Essential_Nodes are computed.
- Const_F_Wt [x ⁇ ,y ⁇ ,s'] Const_F_Wt([ ⁇ (i).. ⁇ (i)+xi-l], [ ⁇ (j).. ⁇ (j)+y ⁇ -l], s'). Consequently, it must be noted that for every xi, yi, and s' in any intermediate step in the method, the quantity Const_T_Wt() that has to be stored in the permanent array can be obtained by inco ⁇ orating these base values again, and has the form Const_T_Wt [xi+bi, yi+b2, s']. This is the rationale for the Process Compute Const _ T_ Wt formally described above.
- U can be an arbitrary subsequence tree of X*, it is obviously meaningless to compare Y with every X e H using any known unconstrained tree editing algorithm. Before Y can be compared to the individual tree in H, the additional information obtainable from the noisy channel will have to be used. Also, since the specific number of substitutions (or insertions/deletions) introduced in any specific transmission is unknown, it is reasonable to compare any X e H and Y subject to the constraint that the number of substitutions that actually took place is its best estimate. Of course, in the absence of any other information, the best estimate of the number of substitutions that could have taken place is indeed its expected value, L.
- RNA ribonucleic acid
- RNA ribonucleic acid
- e et al. Computers and Biomedical Research, 22, 461-473 (1989)
- Shapiro and Zhang Comp. Appl. Biosci. (1990)
- Shapiro Comput. Appl. Biosci., 387-393 (1988)
- Takahashi et al. A molecule of RNA is made up of a long sequence of subunks (the Ribonucleotides (RN)) which are linked together.
- RN Ribonucleotides
- Each Ribonucleotide contains one of the four possible bases, abbreviated by A, C, G, and U.
- This base sequence is called the primary structure of the RNA molecule.
- One example of an item that can be represented by a tree structure is the secondary structure of Ribonucleic Acids (RNA).
- RNA Ribonucleic Acids
- RNA sequence twists and bends and the bases form bonds with one another to yield complicated patterns. The latter bonding pattern is called its secondary structure.
- Research in this field has shown that similar structures have similar functionality and the use of sequence comparison by itself is inadequate for determining the structural homology as described by Shapiro and Zhang (Comp. Appl. Biosci. (1990)).
- RNA sequence may be represented as a tree, as explained by Shapiro and Zhang (Comp. Appl. Biosci. (1990)) and Shapiro
- RNA secondary structure trees can also help identify conserved structural motifs in an RNA folding process and construct taxonomy trees as explained by Shapiro and Zhang (Comp. Appl. Biosci. (1990)).
- the method proposed here can be used to recognize (classify) RNA secondary structure trees by merely processing noisy (garbled) versions of their subsequence trees. This could assist the biologist trace proteins when only their fragments are available for examination.
- FIG. 19 is a schematic diagram showing how the method described by the invention can be used in the recognition of RNA molecules from their noisy fragments. Since RNA secondary structures can be directly represented as a tree structure, the recognition of the RNA secondary structures from its fragment is a straightforward application of the solution to the NsuT problem.
- the inter-symbol distances in this case can be specified in terms of the likelihood of one base (or base pair) being misrepresented by another. This is traditionally achieved using the negative likelihood function. In the absence of such information, traditional 0/1 distances for equal/non-equal bases or base pairs can be utilized. They can also be learnt using the training methodology explained earlier. Use in Taxonomy
- taxonomy refers to the science of classifying organisms; the process of classification provides a framework for the organization of items.
- the notion of taxonomy is extended well beyond the classification of organisms to items such as DNA gene sequences, for example.
- the value of classification comes from it serving as an index to stored information, having a heuristic value which allows for prediction and inte ⁇ olation, which permits the making of generalizations, and serves as a basis for explanation.
- the three main schools or philosophical approaches to taxonomy are 1) phenetic taxonomy or numerical taxonomy, which classifies on the basis of overall mo ⁇ hological or genetic similarity; 2) cladistic taxonomy or phylogenetic taxonomy, which classifies strictly on branching points; and 3) evolutionary taxonomy, traditional taxonomy, or gradistic taxonomy, classifies on a combination of branching and divergence.
- the second is the representation of each element (eg. a gene sequence) in an 'element specific' or 'signature' tree structure form, as dictated by the kinds of different features and the relationship of such features which each element may or may not have and in a manner similar to Figures 19A-D, 20A-D and 21A-D.
- the method of the invention uses the matching of this second type of tree structures to then identify the closest known element in a relational tree of elements and thereby obtain information regarding, for example, related gene sequences.
- the method of this invention can be utilized in tree and string taxonomy in a straightforward manner, when tree taxonomy or string taxonomy is applicable to determining the relationship between two or more elements.
- the tree taxonomy problem involves determining the similarity/dissimilarity and relationship between the various trees in a set of trees. These trees can be, for example, the tree representations of various viruses/bacteria or the genetic tree representations of various biological species, or compounds. Generally, pairs of trees having shorter inter-tree distances are more likely to be inter-related than those with longer inter-tree distances, permitting a relationship between the various trees to be determined.
- an enhanced similarity/dissimilarity measure ie. the inter-tree constrained edit distance
- an enhanced similarity/dissimilarity measure ie. the inter-tree constrained edit distance
- sets of trees having shorter inter-tree distances measured using the method of this invention may be clustered according to their similarity, into sub-dictionaries, each sub-dictionary containing a cluster of similar trees.
- a hierarchical classification can be achieved. This clustering process can be repeated recursively to further refine the hierarchical classification.
- strings can be considered as a tree in which each parent node has exactly one child
- the current invention can be directly applied to the corresponding problems involving strings - including the string taxonomy problem which involves determining the mutating relationships between the elements of a set of strings, which strings can be, for example, the representations of various viruses/bacteria, or the genetic string representations of various biological species, or compounds.
- the method of this invention can be used to recognize chemical compounds that are described in terms of molecules. They are recognized from their noisy fragments, also described in terms of their component molecules. Since chemical compounds are graphs, each compound is first mapped into a set of representative tree structures. Similarly, the noisy fragment of the compound is also mapped into a set of representative tree structures. The compound recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each compound and the tree representations of the noisy fragment.
- Figure 20 is a schematic diagram showing how the invention can be used for this pu ⁇ ose, and the implementation of the invention is straightforward by specifying the inter-symbol distances between the molecules.
- the method of this invention can be used to recognize chemical compounds that are described in terms of atomic structures. They are recognized from their noisy fragments, also described in terms of their component atomic structures. Since chemical compounds are graphs, each compound is first mapped into a set of representative tree structures, where the nodes are the atoms. Similarly, the noisy fragment of the compound is also mapped into a set of representative tree structures. The compound recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each compound and the tree representations of the noisy fragment.
- Figure 20 is a schematic diagram showing how the invention can be used for this pu ⁇ ose, and the implementation of the invention is straightforward by specifying the inter-symbol distances between the respective atoms.
- These distances can be specified in terms of the likelihood of one atom being transformed into (misrepresented by) another, and is related to the positions of the atoms in the periodic table. This can be achieved using the negative likelihood function of the confusion probabilities. In the absence of such information, traditional 0/1 distances for equal/non-equal symbols can be utilized. The inter-symbol distances can also be learnt using the training methodology explained earlier.
- the method of this invention can be used to recognize finge ⁇ rints.
- the finge ⁇ rints are first preprocessed as described by Johannesen et al. ((Proc. of SSPR'96, (1996)) and described in terms of their minuatae. This is the straightforward necessary step required in any finge ⁇ rint recognition system, because the finge ⁇ rint image has to represented in terms of the features, and the best features in this problem domain are the minuatae. They are recognized from their noisy sub-portions which may or may not be contiguous. These noisy sub-portions are also described in terms of their component minuatae after the same preprocessing.
- each finge ⁇ rint is first mapped into a set of representative tree structures.
- the noisy fragment of the finge ⁇ rint is also mapped into a set of representative tree structures.
- the finge ⁇ rint recognition is achieved by invoking the solution to the NSuT problem between the various tree representations of each finge ⁇ rint and the tree representations of the noisy sub-portion.
- Figure 22 is a schematic diagram showing how the invention can be used for this pu ⁇ ose, and the implementation of the invention is straightforward by specifying the inter-symbol distances between the respective types of minuatae.
- distances can be specified in terms of the likelihood of one minuatae being transformed into (misrepresented by) another, and is related to the characteristics of the image processing environment which distinguishes the minuatae themselves from the "raw" image. In the absence of such information, traditional 0/1 distances for equal/non-equal minuatae can be utilized. The inter-symbol distances between the minuatae can also be learnt using the training methodology explained earlier.
- the method of this invention can be used to recognize maps.
- the maps are first preprocessed using standard image processing preprocessing operations (see Haralick and Shapiro (Computer and Robot Vision (1992))) and described in terms of their distinguishing features (landmarks) such as stop signs, yields, stop lights, bridges, railroad crossings etc. This is the straightforward necessary step and is usually available in most geographical information systems.
- the maps are recognized from their noisy sub-portions which may or may not be contiguous. These noisy sub-portions are also described in terms of their component distinguishing features after the same preprocessing. Since numerous tree representations of each map are possible, each map is first mapped into a set of representative tree structures.
- FIG. 23 is a schematic diagram showing how the invention can be used in the recognition of maps.
- the implementation of the invention to this problem domain is straightforward by specifying the inter-symbol distances between the respective types of distinguishing features. These distances can be specified in terms of the likelihood of one distinguishing feature being transformed into (misrepresented by) another, and is related to the characteristics of the image processing environment of the GIS system which recognizes the distinguishing features themselves from the "raw" image. Again, in the absence of such information, traditional 0/1 distances for equal/non-equal distinguishing landmarks can be utilized.
- the inter-symbol distances can also be learnt using the training methodology explained earlier.
- the method of this invention can be used to recognize electronic circuitry.
- the circuits are first preprocessed and described in terms of their components and wiring diagrams which form the nodes and edges of the underlying graph.
- the nodes in this case are the various electronic components such as resistors, diodes, transistors, capacitors etc. Obtaining this representation is the straightforward - since most circuits are designed on paper (or in a computer) before they are implemented in hardware.
- the circuits are recognized from their noisy sub-portions which may or may not be contiguous. Thus the portion of the circuit available may come from different portions of the circuit to be recognized. Since numerous tree representations of each electronic circuit are possible, each electronic circuit is first mapped into a set of representative tree structures.
- FIG. 24 is a schematic diagram showing how the invention can be used in this application domain.
- the implementation of the invention to this problem domain is straightforward by specifying the inter-symbol distances between the respective types of components. These distances can be specified in terms of the likelihood of one component (resistor, diode etc.) being transformed into (misrepresented by) another, and is related to the characteristics of the hardware set-up which recognizes the components themselves from the actual circuit or printed circuit board. Again, in the absence of such information, traditional 0/1 distances for equal/non-equal components can be utilized. As before, the inter-symbol distances can also be learnt using the training methodology explained earlier
- the method of this invention can be used to recognize flow charts.
- the flow charts are first preprocessed and described in terms of their graphical features (the symbolic icons) which form the nodes of the underlying graph.
- the nodes in this case are the various symbols used in flow charting such as assignments, loops, comparisons, control structures etc. Obtaining this representation is the straightforward - since most flow charts are drawn on paper (or in a computer) before they are implemented in software.
- the flow charts are recognized from their noisy sub-portions which may or may not be contiguous. Since numerous tree representations of each flow chart are possible, each flow chart is first mapped into a set of representative tree structures. Similarly, the noisy fragment of the flow chart sought for is also mapped into a set of representative tree structures.
- FIG. 25 is a schematic diagram showing how the invention can be used in the recognition of flow charts.
- the implementation of the invention to this problem domain is straightforward by specifying the inter-symbol distances between the respective types of flow-charting iconic symbols. These distances can be specified in terms of the likelihood of one symbol being transformed into (misrepresented by) another. As usual, in the absence of such information, they can be learnt using the training process explained earlier or traditional 0/1 distances for equal/non-equal iconic symbols can be utilized.
- the method of the present invention can be applied to the fundamental problem of data mining in areas where current day technology is not applicable.
- the data to be mined is represented symbolically.
- Current day syntactic data mining tools would seek for patterns in which the relationship between the symbols in the data is governed by a left-to-right or right-to-left ordering.
- the method of this invention would be capable of mining the data where the relationship between the symbols in the data is governed by both a left-to-right (or right-to-left) ordering and a latent parent-child relationship.
- the method could be used to discover patterns which are actually governed by a tree relationship, but which relationship is occluded by the string representation of the data to be mined.
- the method of the invention can search for the pattern where the pattern sought for is distributed over a larger supersequence as "4abcbfsjd2iejf6iejfif6". Furthermore, this supersequence could also be noisy, for example, "4abcbfsjd2iejf6iejfif3".
- the method of the present invention can be used in musical applications.
- a user is searching for a musical piece in a music library.
- the user intends to discover a musical piece, but the input to the search mechanism would be a poorly played (for example, by playing on a keyboard) version of only a segment of one "part" (as in suprano, alto, tenor and bass) of the score.
- neither these segments nor the individual notes need be contiguous.
- the method of this invention can be used to search for and present the user with the best score in the library that contains the poorly played segment as a sub-score or as a sequence of incorrectly played notes.
- the notes of the score could be the symbols in the alphabet, and each "part" could be treated as a separate sequence of notes which collectively describe the concerned score.
- the method of the invention would work with the string (i.e., the uni-dimensional left-to-right) representation since the tree representation is superfluous.
- the string representation can be mapped to a tree representation by each node having only a single child.
- the tree structures of the patterns were studied as parenthesized lists in a left-to-right post-order fashion.
- L (B C D 'a')
- B, C and D can themselves be trees in which cases the embedded lists of B, C and D are inserted in L.
- Figure VI A specific example of a tree (taken from the dictionary) and its parenthesized list representation is given in Figure VI.
- H consisted of 25 manually constructed trees which varied in sizes from 25 to 35 nodes.
- An example of a tree in H is given in
- Figure VI To generate a NSuT for the testing process, a tree X* (unknown to the classification process) was chosen. Nodes from X* were first randomly deleted producing a subsequence tree, U. In the experimental set-up the probability of deleting a node was set to be 60%. Thus although the average size of each tree in the dictionary was 29.88, the average size of the resulting subsequence trees was only 11.95.
- the alphabet involved was the English alphabet, and the conditional probability of inserting any character a e A given that an insertion occurred was assigned the value 1/26. Similarly, the probability of a character being deleted was set to be 1/20.
- Table I The table (Table I) of probabilities for substitution (the confusion matrix) was based on the proximity of the character keys on a standard QWERTY keyboard and is given in Figure 26. The channel essentially mutated the nodes (characters, in this case) in the list ignoring the parenthesis, and whenever an insertion or a deletion was introduced special case scenarios were considered so as to insert the additional required parenthesis or remove the superfluous parenthesis respectively. Furthermore, the maintenance of the parenthesis was done in such a way that the underlying expression of parenthesis was well-matched.
- Table IV The noise statistics associated with the set of noisy subsequence trees used in testing.
- G is the stochastic grammar with associated probabilities, P, given below :
- a tree X* (unknown to the PR system) was chosen. Nodes from X* were first randomly deleted producing a subsequence tree, U. In the present case the probability of deleting a node was set to be 60%. Thus although the average size of each tree in the dictionary was 31.45, the average size of the resulting subsequence trees was only 13.42.
- the garbling effect of the noise was then simulated as in the earlier set-up.
- the subsequence tree U was subjected to additional substitution, insertion and deletion errors by passing the string representation through a channel causing substitution, insertion and deletion errors as described earlier while simultaneously maintaining the underlying list representation of the tree.
- the alphabet being the English alphabet
- the probabilities of insertion, deletion and the various confusion substitutions were as described earlier and were based on the QWERTY keyboard.
- NSuTs The average number of tree deforming operations done per tree was 3.77.
- Table V gives the average number of errors involved in the mutation of a subsequence tree, U.
- Table V The noise statistics, associated with the set of noisy subsequence trees used in testing.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP00967448A EP1224613A1 (en) | 1999-10-07 | 2000-09-29 | A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing |
AU77645/00A AU7764500A (en) | 1999-10-07 | 2000-09-29 | A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing |
CA2386578A CA2386578C (en) | 1999-10-07 | 2000-09-29 | A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing |
IL14901900A IL149019A0 (en) | 1999-10-07 | 2000-09-29 | A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA2,285,171 | 1999-10-07 | ||
CA2285171 | 1999-10-07 | ||
CA2299047 | 2000-02-21 | ||
CA2,299,047 | 2000-02-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2001026044A1 true WO2001026044A1 (en) | 2001-04-12 |
Family
ID=25681239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CA2000/001107 WO2001026044A1 (en) | 1999-10-07 | 2000-09-29 | A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1224613A1 (en) |
AU (1) | AU7764500A (en) |
IL (1) | IL149019A0 (en) |
WO (1) | WO2001026044A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8626693B2 (en) | 2011-01-14 | 2014-01-07 | Hewlett-Packard Development Company, L.P. | Node similarity for component substitution |
US8730843B2 (en) | 2011-01-14 | 2014-05-20 | Hewlett-Packard Development Company, L.P. | System and method for tree assessment |
US8832012B2 (en) | 2011-01-14 | 2014-09-09 | Hewlett-Packard Development Company, L. P. | System and method for tree discovery |
US9053438B2 (en) | 2011-07-24 | 2015-06-09 | Hewlett-Packard Development Company, L. P. | Energy consumption analysis using node similarity |
US9589021B2 (en) | 2011-10-26 | 2017-03-07 | Hewlett Packard Enterprise Development Lp | System deconstruction for component substitution |
US9817918B2 (en) | 2011-01-14 | 2017-11-14 | Hewlett Packard Enterprise Development Lp | Sub-tree similarity for component substitution |
-
2000
- 2000-09-29 WO PCT/CA2000/001107 patent/WO2001026044A1/en not_active Application Discontinuation
- 2000-09-29 EP EP00967448A patent/EP1224613A1/en not_active Withdrawn
- 2000-09-29 IL IL14901900A patent/IL149019A0/en unknown
- 2000-09-29 AU AU77645/00A patent/AU7764500A/en not_active Abandoned
Non-Patent Citations (2)
Title |
---|
OOMMEN B J: "RECOGNITION OF NOISY SUBSEQUENCES USING CONSTRAINED EDIT DISTANCES", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,US,IEEE INC. NEW YORK, vol. PAMI-9, no. 5, 1 September 1987 (1987-09-01), pages 676 - 685, XP000036961, ISSN: 0162-8828 * |
ZHANG K: "Algorithms for the constrained editing distance between ordered labeled trees and related problems", PATTERN RECOGNITION,US,PERGAMON PRESS INC. ELMSFORD, N.Y, vol. 28, no. 3, 1 March 1995 (1995-03-01), pages 463 - 474, XP004024864, ISSN: 0031-3203 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8626693B2 (en) | 2011-01-14 | 2014-01-07 | Hewlett-Packard Development Company, L.P. | Node similarity for component substitution |
US8730843B2 (en) | 2011-01-14 | 2014-05-20 | Hewlett-Packard Development Company, L.P. | System and method for tree assessment |
US8832012B2 (en) | 2011-01-14 | 2014-09-09 | Hewlett-Packard Development Company, L. P. | System and method for tree discovery |
US9817918B2 (en) | 2011-01-14 | 2017-11-14 | Hewlett Packard Enterprise Development Lp | Sub-tree similarity for component substitution |
US9053438B2 (en) | 2011-07-24 | 2015-06-09 | Hewlett-Packard Development Company, L. P. | Energy consumption analysis using node similarity |
US9589021B2 (en) | 2011-10-26 | 2017-03-07 | Hewlett Packard Enterprise Development Lp | System deconstruction for component substitution |
Also Published As
Publication number | Publication date |
---|---|
IL149019A0 (en) | 2002-11-10 |
AU7764500A (en) | 2001-05-10 |
EP1224613A1 (en) | 2002-07-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7287026B2 (en) | Method of comparing the closeness of a target tree to other trees using noisy sub-sequence tree processing | |
Blumer et al. | Complete inverted files for efficient text retrieval and analysis | |
Haussler | Convolution kernels on discrete structures | |
US20070093942A1 (en) | Method for solving waveform sequence-matching problems using multidimensional attractor tokens | |
Abu-Aisheh et al. | Efficient k-nearest neighbors search in graph space | |
Gawrychowski et al. | Efficiently testing Simon's congruence | |
US20030130977A1 (en) | Method for recognizing trees by processing potentially noisy subsequence trees | |
Mäkinen et al. | Transposition invariant string matching | |
Oommen et al. | String taxonomy using learning automata | |
Grohe et al. | Learning MSO-definable hypotheses on strings | |
Navarro | Pattern matching | |
Ukkonen et al. | Sweepline the music | |
Blumenthal | New techniques for graph edit distance computation | |
Huang et al. | Fast algorithms for finding the common subsequence of multiple sequences | |
Ye et al. | Learning deep graph representations via convolutional neural networks | |
EP1224613A1 (en) | A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing | |
Hadzic et al. | UNI3-efficient algorithm for mining unordered induced subtrees using TMG candidate generation | |
Eidhammer et al. | A constraint based structure description language for biosequences | |
Casali et al. | A catalogue of orientable 3-manifolds triangulated by 30 colored tetrahedra | |
Oommen | String alignment with substitution, insertion, deletion, squashing, and expansion operations | |
Hakata et al. | Algorithms for the longest common subsequence problem for multiple strings based on geometric maxima | |
CA2386578C (en) | A method of comparing the closeness of a target tree to other trees using noisy subsequence tree processing | |
JP2005528713A (en) | How to solve frequency, frequency distribution and sequence matching problems using multi-dimensional attractor tokens | |
CN109727645B (en) | Biological sequence fingerprint | |
Oommen et al. | On the pattern recognition of noisy subsequence trees |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2386578 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 149019 Country of ref document: IL |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000967448 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: IN/PCT/2002/497/KOL Country of ref document: IN |
|
WWP | Wipo information: published in national office |
Ref document number: 2000967448 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2000967448 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: JP |