GB2437560A - Constructing Aho Corasick trees - Google Patents

Constructing Aho Corasick trees Download PDF

Info

Publication number
GB2437560A
GB2437560A GB0608420A GB0608420A GB2437560A GB 2437560 A GB2437560 A GB 2437560A GB 0608420 A GB0608420 A GB 0608420A GB 0608420 A GB0608420 A GB 0608420A GB 2437560 A GB2437560 A GB 2437560A
Authority
GB
United Kingdom
Prior art keywords
node
failure
nodes
string
prefix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0608420A
Other versions
GB0608420D0 (en
Inventor
Neil Duxbury
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Roke Manor Research Ltd
Original Assignee
Roke Manor Research Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roke Manor Research Ltd filed Critical Roke Manor Research Ltd
Priority to GB0608420A priority Critical patent/GB2437560A/en
Publication of GB0608420D0 publication Critical patent/GB0608420D0/en
Priority to US11/783,201 priority patent/US7769788B2/en
Publication of GB2437560A publication Critical patent/GB2437560A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention comprises a method of constructing an Aho-Corasick tree characterised wherein the tree is constructed in a general depth first manner, one string at a time. Preferably during construction of said string, after each node is added, failure links or extended failure links from that node to other nodes are added and failure links to nodes in the newly inserted branch are re-updated.

Description

<p>Improved Aho-Corasick Methodology for String Searching In many
information retrieval and text-editing applications it is necessary to be able to locate quickly some or all occurrences of user-specified patterns of words and phrases in text. I'he paper entitled "Efficient String Matching: An Aid to Bibliographic Search" by Alfred V. Aho and Margaret J. Corasick, Bell Laboratories describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm consists of two parts. In the first part we construct from the set of keywords a finite state pattern matching machine; in the second part we apply the text string as input to the pattern matching machine. The machine signals whenever it has found a match for a keyword.</p>
<p>The prior art Aho Corasick methodology will now be described to get a background before the invention is described.</p>
<p>A string is simply a finite sequence of characters. Let K = (y/,y2 yk) be a finite set of strings which we shall call keywords and let x be an arbitrary string which we shall call the text string. Our problem is to locate and identify all substrings of x which are keywords in K. Substrings may overlap with one another. A pattern matching machine for K is a program which takes as input the text sting x and produces as output the locations in x at which keywords of K appear as substrings. The pattern matching machine consists of a set of states or nodes. Each state is represented by a number. The machine processes the text string x by successively reading the characters in x, making state transitions and occasionally emitting output. The behaviour of the pattern matching machine is dictated by three functions: a goto function g, a failure function j and an output function output. Figure 1 shows the functions used by a pattern matching machine for the set of keywords {he, she, his, hers}.</p>
<p>One state (usually 0) is designated as a start or root node. In Figure 1 example, the nodes are 0, 1 9. The goto function g maps a pair consisting of a state and an input character into a</p>
<p>I</p>
<p>node or the message fail. The directed graph in Figure 1 (a) represents the goto function. For example, the edge labelled h from 0 to I indicates that g (0,h) = 1. The absence of an arrow indicates fail. Thus, g (I,c) fail for all input characters that are not c or i. All our pattern matching machines have the property that g(0, ) !=fail for all input characters. We shall see that this property of the goto function on state 0 ensures that one input character will be processed by the machine every machine cycle.</p>
<p>The failure functionfmaps a node into a node. The failure function is consulted whenever the goto function reports fail. Certain nodes are designated as output nodes which indicate that a set of keywords has been found. The output function formalizes this concept by associating a set of keywords (possibly entry) with every node.</p>
<p>An operating cycle of a pattern matching machine is defined as follows. Let s be the current node of the machine and a the current character of the input string x.</p>
<p>1. If g (s,a) = s', the machine makes a goto transition. It enters state s', and the next character of x becomes the current input character. In addition, if output (s) != empty, then the machine emits the set output (s) along with the position of the current input character.</p>
<p>The operating cycle is now complete.</p>
<p>2. If g (is,a, = fail, the machine consults the failure function f and is said to make afailure transmission. If f(s, = s', the machine repeats the cycle with s' as the current node and a as the current input.</p>
<p>Initially, the current state of the machine is the start state and the first character of the text string is the current input character. The machine then processes the text string by making one operating cycle on each character of the text string. For example, consider the behaviour of the machine M that uses the functions in Figure 1 to process the text string "ushers." Figure 2 indicates the state transitions made by M in processing the text string.</p>
<p>Table I. Sequence of node transitions.</p>
<p>u s h e r s 0034 589 Consider the operating cycle when M is in state 4 and the current input character is e. Since g(4,e) = 5, the machine enters state 5, advances to the next input character and emits output (5), indicating that it has found the keywords "she" and "he" at the end of position four in the text string, in state 5 on input character r, the machine makes two node transitions in its operating cycle. Since g(15,r) = fail, M enters node 2 = f(5). Then since g(2,r) = 8, M enters node 8 and advances to the next input character. No output is generated in this operating cycle.</p>
<p>We say that the three functions g, f, and output are valid for a set of keywords if with these functions Algorithm I indicates that keyword y ends at position i of text string x if and only if x = uyv and the length of uy is i.</p>
<p>We shall now show how to construct valid goto, failure and output functions from a set of keywords. There are two parts to the construction. In the first part we determine the states and the "goto" function. In the second part we compute the failure function. The computation of the output function is begun in the first part of the construction and completed in the second part.</p>
<p>To construct the "goto" function, we shall construct a gala graph. We begin with a graph consisting of one vertex which represents the state 0. We then enter each keyword y into the graph, by adding a directed path to the graph that begins at the start state. New vertices and edges are added to the graph so that there will be, starting at the start state, a path in the graph that spells out the keyword y. The keyword y is added to the output function of the state at which the path terminates. We add new edges to the graph only when necessary.</p>
<p>For example, suppose {he, she, his, hers} is the set of keywords. Adding the first keyword to the graph, we obtain the trie of figure 2a. The path from state 0 to state 2 spells out the keyword "he"; we associate the output "he" with state 2. Adding the second keyword "she," we obtain fig.c 2b. The output "she" is associated with state 5. Adding the keyword "his," we obtain fig 2c. Notice that when we add the keyword "his" there is already an I edge labelled h from state 0 to state I, so we do not need to add another edge labelled h from state 0 to state I. The output "his" is associated with state 7. Adding the last keyword "hers," we obtain fig. 2d. The output "hers" is associated with state 9. Here we have been able to use the existing edge labelled Ii from state 0 to I and the existing edge labelled e from state I to 2. Up to this point the graph is a rooted directed tree. To complete the construction of the goto function we add a loop from state 0 to state 0 on all input characters other than h or s. We obtain the directed graph shown in Figure 1(a). This graph represents the goto function.</p>
<p>The failure function is constructed from the goto function. Let us define the depth of a state s in the goto graph as the length of the shortest path from the start state to s. Thus iii Figure 1 (a), the start state is of depth 0, states I and 3 are of depth I, states 2, 4, and 6 are of depth 2, and so on. We shall compute the failure function for all states of depth I, then for all states of depth 2, and so on, until the failure function has been computed for all states (except state 0 for which the failure function is not defined). The algorithm to compute the failure function f at a state is conceptually quite simple. We make f(s) = 0 for all states s of depth 1. Now suppose f has been computed for all states of depth less than d. The failure function for the states of depth d is computed from the failure function for the states of depth less than d. The states of depth d can be determined from the non fail values of the goto function of the states of depth d-l.</p>
<p>Specifically, to compute the failure function for the nodes of depth d, we consider each node r of depth d -1 and perform the following actions.</p>
<p>I. If g (r. a) = fail for all a, do nothing.</p>
<p>2. Otherwise, for each character a such that g(r.a) = s, do the following: (a) Set node = f(r).</p>
<p>(b) Execute the statement node -f(node) zero or more times, until a value for node is obtained such that g (node, a) fail. (Note that since g(O,a) fail for all a, such a node will always be found.) (c) Set f(s) = g (node, a).</p>
<p>For example, to compute the failure function from Figure 1(a), we would first set f(1) = f(3) = o since 1 and 3 are the nodes of depth 1. We then compute the failure function fro 2, 6 and 4, the nodes of depth 2. To compute f(2), we set node = f(l) = 0; and since g(O, e) = 0, we find that f(2) = 0. To compute f(6), we set node node = f(1) = 0; and since g(0, i) = 0, we find that f(6) = 0. To compute f(4), we set node = f(3) = 0; and since g(0, h) = 1, we find that f(4) = 1.</p>
<p>Continuing in this fashion, we obtain the failure function shown in Figure 1(b).</p>
<p>During the computation of the failure function we also update the output function. When we determine f(s) s', we merge the outputs of nodes with the output of node s'. For example, from Figure 1(a) we determine f(S) 2. At this point we merge the output set of state 2, namely {he}, with the output set of node 5 to derive the new output set {he, she}. The final nonempty output sets are shown in Figure 1(c).</p>
<p>So far we have only discussed the case where there is only one failure link going from a particular node. In a refined version of the Aho-Corasick methodology discussed also in the paper, where there is a failure at a particular node there may be a multiple of failure links depending on the character under consideration. This is best described with reference to figure X which shows a table of the failure links for the same example above. The next move function is encoded in Figure 3 as follows. In node 0, for example, we have a transition on to state I, a transition on s to node 3, and a transition on any other character to node 0. In each node, the dot stands for any other character. This refined methodology is referred to hereinafter as extended link methodology, the previous defined as normal failure link. The invention described hereinafter is applicable to both.</p>
<p>Problem So far what has been discussed above has been the known art. The drawback of the known Aho -Corasick terminology is the need to recompile the structure if an update is made. This takes a considerable amount of processing power especially as the known Aho-Corasick methodology has to be built up in "breadth first" i.e. a depth at a time for each string.</p>
<p>The current application addresses this issue by defining an algorithm for constructing the automaton in a depth first manner. This is used with a specific realisation of the approach to provide an efficient mechanism to update the automaton without a full recompilation. With the conventional breath first approach to building the extended version of the automaton the addition of a string to the existing structure would require all of the state transitions to be updated. For large keyword sets the computational cost of updating the entire structure is excessive and prevents the structure from being updated whilst online.</p>
<p>The invention comprises a method of constructing an Aho-Corasick tree characterised wherein the tree is constructed in a general depth first manner, one string at a time. Preferably during construction of said string, after each node is added, failure links or extended failure links from that node to other nodes are added and failure links to nodes in the newly inserted branch are re-updated.</p>
<p>In a preferred efficient method only those links to the nodes in the new branch which need updating are idenlified.</p>
<p>i'erminology For the purposes of clarity the following terminology will be used in the claims.One starts with a non-empty finite set called an alphabet. Elements of this alphabet are called characters. A string over is any finite sequence of characters from. In information processing, a stale is the complete set of properties transmitted by an object to an observer via one or more channels. A transition indicates a state change and is described by a condition that would need to be fulfilled to enable the transition. An automaton is a mathematical model for a finite state machine (FSM). An FSM is a machine that, given an input, jumps through a series of states according to a transition function. This transition function tells the automaton which state to go to next given a current state and a current character. An automaton that can be used to recognise the string nice' is illustrated in figure 4b) below: In the above example it can be seen that every state with the exception of the start and error states labels a prefix of the string nice'. The transitions of the automaton are indicated by the arrows connecting the states. The transition function for the above automaton is as follows; given an arbitrary state and an arbitrary character move to the state pointed to by the transition labelled with said character. For example, if the machine is in the start state and the character is n' the state of the machine will change from the start state to the n found' state.</p>
<p>In the theory of computation, a deterministic finite state machine or deterministic finite automaton (DFA) is a finite state machine where for each pair of state and input character there is one and only one transition to a next state. A trie is an ordered tree data structure that is used to store an associative array where the keys are strings. A trie can be seen as a deterministic finite state automaton. A suffix trie of a string is a trie representing all the suffixes of that string. In general a state can also be referred to as a node and a transition can he referred to as an edge. The deterministic finite state automaton for the strings to', tea', ten', i', in', and inn' is shown in figure 4c.</p>
<p>Within the automaton the start node is commonly referred to as the root, this is shown in the diagram as the uppermost node (see fig 4a.) A parent node is a node that is linked via an edge to a deeper node in the figure. For example the node labelled t' is the parent of the nodes labefled to' and te'. A child node is a node that is linked to by a parent node. Starting at an arbitrary node each of the nodes edges defines a transition to another node in the trie. In this case each edge in the trie is associated with a character or label. A branch is defined by a series of edges starting at the root. In the remaining text an edge is explicitly used to define a transition that links a parent node with its children. The transition function for the Aho Corasick automaton is as follows; given an arbitrary node and an arbitrary character move to the node pointed to by the edge labelled with said character. If there is no edge labelled with the character then use the failure function to determine the next node. For the Aho Corasick automaton given a current node and the current character the failure function is used to define the state to go to next when the goto function returns fail. A failover node in the automaton is a node returned by the failure function. A failure link is a transition that joins a node to a failover node. If the failover node does not exist the failure link joins a node to the root node.</p>
<p>A failure link is a specialisation of an edge. A failure link is used to define a transition that links a node and its failover node.</p>
<p>MethodoIoy for depth-wise construction of the Aho-Corasick String Matching Automaton with Normal Failure Link The following is a basic methodology according to a simple embodiment of the invention.</p>
<p>To build the tree in depth-wise fashion branches for each string are added one at a time as follows: a) form the root, h) starting from the root, take character string and build it depth-wise from the root; i) add ing a node at a time; ii) set up the failure link for the node as follows: I) find the longest suffix (including the current character) of the current branch that also a prefix of another or the current branch; if such a prefix exists then set the failure link to that location.</p>
<p>II) correct any corruption: the insertion of the new node into an existing automaton may corrupt the failure links of the existing nodes in the automaton; i) identify which of the existing nodes has been affected by the insertion of the new node; ii) find the longest suffix (including the current character) of the current branch that also a prefix of another or the current branch. If such a prefix exists then set the failure link to that location; c) repeat steps i) and ii) for each node of current string; d) repeat step h) and c) for each new branch inserted into the automaton.</p>
<p>In other words, the automaton (tree) is built up depth-wise; i.e. a string at a time. l'his is a fundamental difference between the prior art where the tree is build up breadth-wise; i.e. a level at a time. After the insertion of each node (after the first string has been added) failure links are determined by looking at previous suffixes of the branch that match prefixes of other already inserted branches. Corruption may occur e.g. the current inserted branch may have prefixes which are useful for other failure links from other branches already inserted. This is dealt with.</p>
<p>Example I</p>
<p>l'he algorithm processes each string x by successively reading the characters in x, and applying the build function h. The build function is used to map a pair consisting of a node and an input character into a node or the message null. Figure 5a represents the build function. The edge labelled B from 0 to I indicates that b(O, B) = 1. The absence of an arrow indicates null. Thus, h(1, a) = null for all input characters a that are not A. The action of the build function means that the branch is inserted in a depth wise tàshion into the automaton. A build cycle is defined as follows, Let s be the current node and a the current character of the stringx. Fhe following is an detailed algorithm of the general methodology: 1. If b(s, a) = s', the algorithm makes a build transition. it enters node s', and the next character of x becomes the current input character. The build cycle is now complete.</p>
<p>2. If b(s, a) = null, a. add a node s to the automaton b. add an edge to s and sets its label to a.</p>
<p>c. set the edge to reference s'.</p>
<p>d. set up the failure link for s' as follows: find the longest suffix (including the current character) of the current branch that also a prefix of another or the current branch; if such a prefix exists then set the failure link to that location otherwise set the failure link to the root node; e. correct the corruption: the insertion of s' into an existing automaton may corrupt the failure links of the existing nodes in the automaton; the corruption can be corrected in two phases: i. identify which of the existing nodes has been affected by the insertion of the new node; ii. re-apply step d) to each of the affected nodes in a depth first manner as follows: I. form a string from the concatenation of the edge labels leading from the root to the corrupt node; 2. re-apply step d) with the corrupt node taking the role of the current node; f. enters the node s' and the next character of x becomes the current input character; g. the build cycle is now complete.</p>
<p>The operations performed in step 2 are illustrated in figures 5 b) to g) for the case where the character B of the string AB is inserted, within the figure failure links are shown by reference numeral 51.</p>
<p>Initially, the current node of the algorithm is the root node and the first character of x is the current input character. The machine then processes x by making one build cycle on each character of x. When the final character of x is reached a sentinel node is created to mark the end of the branch.</p>
<p>Repeat the above methodology for each new branch inserted into the automaton.</p>
<p>Step e) can be achieved preferably by a further preferred embodiment described in detail later called the "suffix trie" method.</p>
<p>MethodoIoy for depth-wise construction of the Aho-Corasick String Matching Automata with Extended Failure links.</p>
<p>The extended failure link construction algorithm is identical to that of the normal failure link construction algorithm with the following amendments to step b) ii) of the general algorithm for normal links, such that step b) ii becomes: a) set up the failure link for the node as follows: i) find the longest suffix of the inserted path that matches a prefix of an existing or the current path (this is the same as step e); if the prefix exists, then for each edge emerging from the node that represents the prefix, create a corresponding edge in the inserted node that leads to the same destination.</p>
<p>ii) take the next shortest suffix and repeat the previous step taking care not to overwrite any that were created in the previous step.</p>
<p>iii) continue until all the suffixes have been exhausted including the empty suffix; h) correct the corruption: the insertion of the new node into an existing automaton may corrupt the filure links of the existing nodes in the automaton; the corruption can be corrected in two phases: i) identify which of the existing nodes has been affected by the insertion of the new node; ii) re-apply the step a) to each of the affected nodes in a depth first manner by forming a string from the concatenation of the edge labels leading from the root to one of the corrupt nodes; iii) re-apply step a) with the corrupt node taking the role of the current node.</p>
<p>iv) however, in this case an additional constraint exists: if there is already an edge emerging from the corrupt node that is labelled with the same character as one of the edges emerging from the prefix node then if said edge leads to a child of the corrupt node do not overwrite it, otherwise set the edge to point at the destination referenced from the prefix node.</p>
<p>Example 2</p>
<p>The detailed extended failure link construction algorithm is identical to that of the normal failure link construction algorithm with the following amendments to step d) and e): b) i) set up the failure link for the node as follows: ii) find the longest suffix of the inserted path that matches a prefix of an existing or the current path (this is the same as step e); if the prefix exists, then for each edge emerging from the node that represents the prefix, create a failure link labelled with the same character in s that leads to the same destination.</p>
<p>iii) take the next shortest suffix and repeat the previous step taking care not to overwrite any failure links that were created in the previous step.</p>
<p>iv) continue until all the suffixes have been exhausted including the empty suffix.</p>
<p>c) correct the corruption: the insertion of s' into an existing automaton may corrupt the failure links of the existing nodes in the automaton:.</p>
<p>i) identify which of the existing nodes has been affected by the insertion of the new node; ii) re-apply the step a) to each of the affected nodes in a depth first manner by forming a string from the concatenation of the edge labels leading from the root to one of the corrupt nodes and iii) re-applying step a) with the corrupt node taking the role of the current node; iv) however, in this case an additional constraint exists: if the corrupt node contains and edge that is labelled with the same character as one of the edges contained in the prefix node then if said edge leads to a child of the corrupt node do not create a failure link for it. Otherwise set the failure link labelled with the same character in the corrupt node to the same destination referenced from the prefix node.</p>
<p>The operations performed are illustrated in figure 6 for the case where the character B of the string AB is inserted, with the figure failure links are shown by reference numeral 61. Note in this case the insertion of the new node does not corrupt the failure links of the existing nodes.</p>
<p>Preferred embodiment -Use of suffix trie to update corrupted links To insert a new branch into an automaton the algorithm must map the suffixes of the new branch to the prefixes of the existing branches. Then to remove the corruption in the existing branches, the algorithm must map the suffixes of the existing branches to the prefixes of the new branch.</p>
<p>The mapping of the new branch to the existing branches can be performed as the nodes for the new branch are inserted. Thus, once a new branch is created the algorithm needs to find the suffixes of the existing branches that map to a prefix of the latest branch. This can be achieved by forming a suffix trie from the previous branches. The suffix trie is then searched using the inserted branch. The search in the suffix trie can be used to identify the affected nodes in both the normal and extended cases.</p>
<p>Figure 7a shows an example where the normal automaton for the strings BAB and ABA after inserting the string ABA into an existing normal automaton containing the branch BAB is shown below. Figure 7b shows the suffix trie for the existing branch BAB before the branch ABA was added. For the suffix trie the nodes marked with the letter x are suffixes of the branch BAB.</p>
<p>After adding the branch ABA the algorithm can determine the nodes that need updating by searching the suffix trie with the string ABA. In the above case the search would follow the path in the suffix trie denoted by reference numeral 7 land would terminate on node 72. The search indicates that the suffix AB from BAB matches the prefix AB from ABA. Thus, the nodes in the suffix trie leading to the node 72 are the ones that need to be updated to remove the corruption.</p>
<p>To facilitate the algorithm the suffix trie must be constructed as the structure is built. Thus, after inserting a branch the set of suffixes for that string would be added to the suffix trie to be used in latter updates. Note this methodology can be used on a per branch or a per node basis.</p>
<p>For the per node basis the suffix trie is traversed at the same time as naversing the automaton.</p>
<p>Further improved embodiment to be used with the suffix trie method -"edge index" In yet a preferred embodiment (when using the suffix trie method), an even more efficient way of setting up normal and extended failure links in the invention utilises "edge index" Construction of the suffix trie is computationally expensive and consumes a significant amount of memory. The search on the suffix trie essentially provides the location of nodes 2 and 3. However, given the inserted branch, node 3 can be found using only the location of node 2. Node 3 can he found by simply traversing the existing normal automaton using the inserted branch to select which edge to follow out of node 2. The traversal continues until either the inserted branch is exhausted or when a valid edge out of the current node labelled with the next character from the inserted branch does not exist. With this simplification, as long as the location of node 2 is known the location of the corrupted nodes can be determined using only node 2 and the inserted branch.</p>
<p>The property of import of node 2 is that the edge leading to node 2 hasthe same label as the first character of the inserted branch. That is the edge leading to node 2 is a single character prefix of the inserted branch. Each node in the tree has the potential to be the single character prefix of any branch subsequently inserted into the normal automaton.</p>
<p>Consequently, the location of all of the single character prefixes can be recorded as the automaton is built by creating an edge index which records the location of the nodes reached by all of the edges in the automaton. The set of edges associated with the first character of each string inserted into the automaton can be efficiently retrieved if the edge index is sorted accordiiig to the alphabet of the automaton. The edge index for the alphabet [A, BI is illustrated in figure 8 for the branch BAB.</p>
<p>The location of the start of the paths that need to be updated call now be found simply by looking up tile first character of the inserted branch in tile edge index. Tile nodes referenced by the edge index are the locations of the start nodes of each of tile suffixes in the existing branches that form a prefix of tile new branch. The update process starts at the indexed locatioll and then traverses tile automaton along an existing branch until tile algorithm either runs out of characters for tile inserted branch or an edge leading out of a node in the automaton labelled with the current branch character does not exist. Tile update process is performed for each item in edge illdex list referenced with the initial character of the inserted branch. ihe items in the individual lists are accessed in depth first order. For each of tile nodes along ai update path tile failure links are updated using the failure link algorithms discussed previously.</p>
<p>Further improved embodiment to be used with the suffix trie method-"suffix list method" The identification of the corrupted nodes can also be achieved by using a suffix list. Within each node of the automaton we create a list of suffixes that match the prefix formed by tile path up to the current node (inclusive). An example of the suffix list is shown in figure 9.</p>
<p>The suffix list for the node reach by following the edge labelled B out of the root is sllown by reference numeral 91. The suffix list for the node reached by following the edges B then A is shown by reference numeral 92. In general the nodes that could potentially be affected by the insertion of a new node are tile children of tile nodes that lie along the suffix list of the inserted nodes parent. Thus, the corrupt nodes can be found by simply following the suffix list and determining which children are affected. The affected children are those whose edge label matches that of the inserted node. For example, the suffix list is used to identify potentially corrupt nodes as follows: Consider node 3: when node 3 is inserted its presence may modify tile failure link of node 4.</p>
<p>Node 4 can be found by moving to the parent of node 3 (node 1) and following the first link in the suffix list to node 2. We then examine the edge label at node 2 and discover that the label matches the edge label leading to node 3. Consequently a suffix matching the prefix defined by node 3 exists at node 4. The failure link at node 4 can then be updated by re-applying step d.</p>
<p>In general for the normal automaton when a node is inserted the following steps are performed: a) first set the failure link of the inserted node.</p>
<p>b) then move to the parent of the inserted node and follow its suffix list; c) for each node on the suffix list determine whether there is an edge whose label matches that of the edge lead ing to the inserted node.</p>
<p>d) if there is a match then re-apply the failure link algorithm to the node pointed to by the edge.</p>
<p>For the extended automaton the procedure is identical with the exception that we apply extended case failure link algorithm. In this case the suffix lists must be constructed as the data structure is built. However, as we are always using the parents suffix list we can be sure that it is up to date when inserting the new node.</p>
<p>Preferred Embodiment -Use of a base automaton to speed up online insertion Within the normal automaton most of the failure links simply lead back to the root of the automaton. Within the extended automaton a link is created for each character in the automatons alphabet. However, in this case most of the failure links lead back to the set of states reached by following the edges out of the root. in both cases it can he seen that much of the effort used to construct the data structure is spent in setting the failure links to the states described above. This can prove to be a considerable overhead when updating the automata at runtime. Much of the effort used in constructing the automaton can be saved by assuming that the failure link leads to the initial state. This assumption is then only corrected by computing the correct destination during the build process. This effort can be avoided at runtime by creating a base automaton and a pool of nodes whose failure links are pre-configured offline. The base automaton contains the initial characters of all possible strings that can be created with the alphabet used for the set of strings. The base automatons consist of a root node, a set of nodes and a set of edges. For each character contained in each automatons alphabet an edge is created which links the root node to a non terminal node that represents the prefix formed by following the branch from the root labelled by each individual character of the automatons alphabet. In the case of the normal automaton each node may have multiple edges that lead to other nodes on a branch and a single failure link. For the extended case each node may have multiple edges and multiple failure links. Between them the edges and failure links of the extended automaton will cover the alphabet of the automaton.Both the normal and extended the automata are pre-configured to a depth of one.</p>
<p>For the alphabet [A, B] the initialised normal automaton is shown in figure lOa The extended base automaton is also initialised by setting up all the links for the first level of the data structure. For example the alphabet [A, 13] the initialised extended automaton is shown in figure lOb. The set of nodes formed by nodes 1 and 2 are called the base nodes.</p>
<p>A pool of states is then created which are all pre-configured with links to the states of the base automaton. This pool of states is held on a stack which can be accessed at any point during the life of the automaton. If necessary the stack can grow and shrink with the memory requirements of the automaton. These states are then used to build the overall structure. When a new string is inserted there is no longer any need to link the states back to the initial states.</p>
<p>When constructing the trie these pre-configured links are simply overwritten with the links that lead along the path being inserted into the trie. The automaton is then constructed by inserting the strings into the structure in a depth first manner.</p>
<p>Example 3</p>
<p>As can be seen the base automaton shown in figure 11 a consists of the states 0, 1 and 2 that are used to represent the alphabet. For each of the states the relationships established earlier are used to determine both the failure links and the extended failure links. Note both of these are pre-computed for the base automaton. Both sets of links are required so that the links for subsequent strings can be calculated. The pre-computed failure links are shown by reference numeral 1 01 and the pre-computed extended failure links are shown by reference numeral 102. For each string inserted we begin the insertion by adding nodes to the end of the appropriate pre-computed path in the base automaton. For each character of the string being added to the automaton we apply the previous algorithms to calculate the failure links and the extended failure links. This process continues until the string is exhausted. As described previously there is no need to insert the failure links or extended failure links for the pre-computed nodes. However, if a state is found for which the failure link does not lead to the root or the extended failure link does not lead to a state in the set Q, the link is overwritten with a reference to the correct state. This is the case for state 4 where the re-routed failure link is shown by reference numeral 103 and the re- routed extended failure link is shown by reference numeral 104. A similar process is used to insert subsequent strings. The state of the automaton after the insertion of the string ABAB is shown below, where for clarity the extended failure links have been omitted. Note these links were created during the pre-computation stage. As can be seen the action of the build process described so far successfully maps the inserted string onto the existing set of strings. However, in its current form some of the failure links and extended failure links have become corrupt. As can be seen the failure link for stale 4 should now point at state 5 and the extended failure link for state 4 labelled by A' should now point at state 6. This corruption can be fixed by re-applying the failure link and extended failure link algorithms to every node in the automaton. This would require a complete breath first traversal of the automaton which would essentially recompile the stricture. However, the cost of performing this operation on a large automaton is likely to be excessive.</p>
<p>An alternative approach and the approach used in the current algorithm is to simply update the subset of states affected by the insertion of the new string. In order to do this we must be able to determine which nodes these are based on the string being inserted. This can be done with the aid of a suffix trie. The failure function makes a connection between the longest suffix of one path through the tree and the longest prefix of another. Consequently, to update the set of existing strings we must find all of the suffixes of the existing strings that form a prefix of the new string.</p>
<p>In practice this can be achieved by creating a suffix trie of prefixes for the existing set of strings. A suffix trie of prefixes contains the set of suffixes formed by taking successive prefixes of a string e.g. for the string ABAB the set of prefixes are A, AB, ABA, ABAB. Thus, the set of suffixes of the prefixes of ABAB are:</p>
<p>A NULL</p>
<p>AB BNULL</p>
<p>ABA BAANULL</p>
<p>ABAB BABABB NULL</p>
<p>To find the set of nodes that need to be updated we simply search the suffix trie to find the set of states which form a prefix of the string being inserted. Thus, if the inserted string is BAB, the valid suffixes are B, BA and BAB. Within the suffix trie a reference is created to each of the nodes in the trie. These references are then used to find the states that require updating.</p>
<p>i'he failure link and extended failure link algorithms are then reapplied at these states which correctly updates the affected states without the need to recompile the entire automaton.</p>
<p>Thus, the combination of the pre-computed links to the states in the set, Q, and the mechanism for finding the subset of states that need to be recompiled allows the automaton to be updated without rebuilding the entire data structure. The suffix trie of prefixes formed from the existing states can be simplified as we only need to know the location of the first character of each suffix to enable the update of the subsequent nodes. The subsequent states can be found by simply following the success transitions in the trie until there is a inisniatch between the path in the trie and a character in the new string. Consequently, the structure can be updated by forming a state reference table in which we store a reference to each of the shortest possible suffixes of the prefixes of a string. For the example above the shortest suffixes of the prefixes of ABAB are A, B, A and B. This amounts to creating a table to store a reference to the location of each character in the trie. When a new string is inserted the first character in the string is used to look up the list of states which can be reached by following a success transition labelled with that character from another state. For each item in this list the failure link and extended failure link algorithms are then reapplied while there is a match between the string being inserted and the success path through the trie. The use of this table effectively compresses the suffix trie to the minimum number of nodes required to make the updates. The simplicity of the table also means that it can be easily constructed as the existing trie structure is built.</p>

Claims (1)

  1. <p>Claims 1. A method of constructing an Aho-Corasick tree characterised
    wherein the tree is constructed in a general depth first manner, one string at a time.</p>
    <p>2. A method as claimed in claim 1 wherein during construction of said string, after each node is added, failure links or extended failure links from that node to other nodes arc added.</p>
    <p>3. A method as claimed in claim I wherein failure links to nodes in the newly inserted branch are re-updated.</p>
    <p>4. A method as claimed in claim 4 wherein only those links to the nodes in the new branch which need updating are identified.</p>
    <p>5. A method as claimed in claims 1 to 4 comprising the steps of: a) forming the root, h) starting from the root, take character string and build it depth-wise from the root; i) adding a node at a time; ii) set up the failure link for the node as follows: 1) find the longest suffix (including the current character) of the current branch that also a prefix of another or the current branch; if such a prefix exists then set the failure link to that location.</p>
    <p>Ii) correct any corruption by; i) identif'ing which of the existing nodes has been affected by the insertion of the new node; ii) finding the longest suffix (including the current character) of the current branch that also a prefix of another or the current branch; if such a prefix exists then set the failure link to that location; c) repeat steps i) and ii) for each node of current string; d) repeat step b) and c) for each new string inserted into the automaton 6. A method as claimed in claim 6 wherein step b) ii) 1) comprises the following steps: s) find the longest suffix of the inserted path that matches a prefix of an existing or the current path; if the prefix exists, then for each edge emerging from the node that represents the prefix, create a corresponding edge in the inserted node that leads to the same destination.</p>
    <p>t) take the next shortest suffix and repeat the step s) but not to overwriting any that were created in the previous step; u) continue until all the suffixes have been exhausted including the empty suffix; and step b) ii) II correct the corruption comprises: v) identify which of the existing nodes has been affected by the insertion of the new node; w) re-apply the steps s), t) and u) to each of the affected nodes in a depth first manner by forming a string from the concatenation of the edge labels leading from the root to one of the corrupt nodes; x) re-apply steps s) t) and u) with the corrupt node taking the role of the current node; y) if there is already an edge emerging from the corrupt node that is labelled with the same character as one of the edges emerging from the prefix node then if said edge leads to a child of the corrupt node do not overwrite it, otherwise set the edge to point at the destination referenced from the prefix node.</p>
    <p>7. A method as claimed in any preceding claim including the step of creating a base automaton after step a) of claim I having preconfigured failure links 8. A method as claimed in any preceding claim using a suffix tree/trie in one or more steps.</p>
    <p>9. A method as claimed in any preceding claim using a suffix list in one or more steps.</p>
    <p>10. A method as claimed in any preceding claims using an edge index in one or more steps.</p>
GB0608420A 2006-04-28 2006-04-28 Constructing Aho Corasick trees Withdrawn GB2437560A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB0608420A GB2437560A (en) 2006-04-28 2006-04-28 Constructing Aho Corasick trees
US11/783,201 US7769788B2 (en) 2006-04-28 2007-04-06 Aho-Corasick methodology for string searching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0608420A GB2437560A (en) 2006-04-28 2006-04-28 Constructing Aho Corasick trees

Publications (2)

Publication Number Publication Date
GB0608420D0 GB0608420D0 (en) 2006-06-07
GB2437560A true GB2437560A (en) 2007-10-31

Family

ID=36589977

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0608420A Withdrawn GB2437560A (en) 2006-04-28 2006-04-28 Constructing Aho Corasick trees

Country Status (2)

Country Link
US (1) US7769788B2 (en)
GB (1) GB2437560A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8260799B2 (en) 2008-03-31 2012-09-04 Huawei Technologies Co., Ltd. Method and apparatus for creating pattern matching state machine and identifying pattern
US8583961B2 (en) 2008-02-01 2013-11-12 Huawei Technologies Co., Ltd. Method and device for creating pattern matching state machine

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7634500B1 (en) 2003-11-03 2009-12-15 Netlogic Microsystems, Inc. Multiple string searching using content addressable memory
US7353332B2 (en) * 2005-10-11 2008-04-01 Integrated Device Technology, Inc. Switching circuit implementing variable string matching
US8010481B2 (en) * 2006-03-07 2011-08-30 The Regents Of The University Of California Pattern matching technique for high throughput network processing
US7783654B1 (en) * 2006-09-19 2010-08-24 Netlogic Microsystems, Inc. Multiple string searching using content addressable memory
JP5224953B2 (en) * 2008-07-17 2013-07-03 インターナショナル・ビジネス・マシーンズ・コーポレーション Information processing apparatus, information processing method, and program
JP5728979B2 (en) * 2011-02-02 2015-06-03 富士通株式会社 Information processing apparatus, software inspection method, and software inspection program
US9455996B2 (en) * 2011-10-03 2016-09-27 New York University Generating progressively a perfect hash data structure, such as a multi-dimensional perfect hash data structure, and using the generated data structure for high-speed string matching
JP5958088B2 (en) * 2012-05-29 2016-07-27 富士通株式会社 Update program, update method, and update apparatus
CN102867036B (en) * 2012-08-29 2015-03-04 北京工业大学 Improved method for dynamic generation of data structure for Aho-Corasick algorithm
US9268749B2 (en) * 2013-10-07 2016-02-23 Xerox Corporation Incremental computation of repeats
CN103699593A (en) * 2013-12-11 2014-04-02 中国科学院深圳先进技术研究院 Method and system for rapidly traversing generalized suffix tree
EP3087510A1 (en) 2013-12-23 2016-11-02 British Telecommunications Public Limited Company Improved pattern matching machine for repeating symbols
US10423667B2 (en) * 2013-12-23 2019-09-24 British Telecommunications Plc Pattern matching machine
EP3087509A1 (en) 2013-12-23 2016-11-02 British Telecommunications Public Limited Company Improved pattern matching machine with mapping table
US20150324457A1 (en) * 2014-05-09 2015-11-12 Dell Products, Lp Ordering a Set of Regular Expressions for Matching Against a String
CN110222143B (en) * 2019-05-31 2022-11-04 北京小米移动软件有限公司 Character string matching method, device, storage medium and electronic equipment
WO2021236052A1 (en) * 2020-05-18 2021-11-25 Google Llc Inference methods for word or wordpiece tokenization
KR102271489B1 (en) * 2020-12-04 2021-07-02 (주)소만사 Apparatus and method of constructing Aho-Corasick automata for detecting regular expression pattern

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004097643A2 (en) * 2003-04-29 2004-11-11 University Of Strathclyde Monitoring software

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09198398A (en) * 1996-01-16 1997-07-31 Fujitsu Ltd Pattern retrieving device
JP4155382B2 (en) * 2001-01-25 2008-09-24 富士通株式会社 PATTERN SEARCH METHOD, PATTERN SEARCH DEVICE, COMPUTER-READABLE RECORDING MEDIUM CONTAINING PATTERN SEARCH PROGRAM, PATTERN SEARCH SYSTEM, AND PATTERN SEARCH PROGRAM
US7539681B2 (en) * 2004-07-26 2009-05-26 Sourcefire, Inc. Methods and systems for multi-pattern searching
ATE470303T1 (en) * 2005-04-20 2010-06-15 Ibm DEVICE AND METHOD FOR PATTERN DETECTION
US8010481B2 (en) * 2006-03-07 2011-08-30 The Regents Of The University Of California Pattern matching technique for high throughput network processing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004097643A2 (en) * 2003-04-29 2004-11-11 University Of Strathclyde Monitoring software

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Biosequence Algorithm, Spring 2005 Lecture 4: Set matching and Aho-Corasick Algorithm, Pekka Kilpel¯inen *
Construction of Aho Corasick automaton in Linear time for Integer Alphabets, Shiri Dori & Gad Landau *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8583961B2 (en) 2008-02-01 2013-11-12 Huawei Technologies Co., Ltd. Method and device for creating pattern matching state machine
US8260799B2 (en) 2008-03-31 2012-09-04 Huawei Technologies Co., Ltd. Method and apparatus for creating pattern matching state machine and identifying pattern

Also Published As

Publication number Publication date
US20070282835A1 (en) 2007-12-06
GB0608420D0 (en) 2006-06-07
US7769788B2 (en) 2010-08-03

Similar Documents

Publication Publication Date Title
US7769788B2 (en) Aho-Corasick methodology for string searching
McCreight A space-economical suffix tree construction algorithm
Huddleston et al. A new data structure for representing sorted lists
FI102424B (en) Method for implementing memory
US7984005B2 (en) Enhanced artificial intelligence language
San Segundo et al. Infra-chromatic bound for exact maximum clique search
Daciuk et al. Incremental construction of minimal acyclic finite state automata and transducers
US7305372B2 (en) Enhanced artificial intelligence language
Bannai et al. Computing all distinct squares in linear time for integer alphabets
Wu et al. A subquadratic algorithm for approximate regular expression matching
Wagner Embedding arbitrary binary trees in a hypercube
Hall Equivalence between AND/OR graphs and context-free grammars
EP0237745A2 (en) Coding of acyclic list data structures for information retrieval
Djidjev A linear algorithm for the maximal planar subgraph problem
Idury et al. Multiple matching of parameterized patterns
Rizzo et al. Linear time construction of indexable elastic founder graphs
Apt Principles of constraint programming
Preparata et al. A simplified technique for hidden-line elimination in terrains
Idury et al. Multiple matching of parameterized patterns
US20040162797A1 (en) Enhanced artificial intelligence language
Sgarbas et al. Optimal insertion in deterministic DAWGs
Kempf et al. Time optimal left to right construction of position trees
Wang et al. Efficient computation of longest common subsequences with multiple substring inclusive constraints
Engelfriet et al. The equivalence of bottom-up and top-down tree-to-graph transducers
Tsuda et al. An incremental algorithm for string pattern matching machines

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)