CN102867036A - Improved method for dynamic generation of data structure for Aho-Corasick algorithm - Google Patents
Improved method for dynamic generation of data structure for Aho-Corasick algorithm Download PDFInfo
- Publication number
- CN102867036A CN102867036A CN2012103124789A CN201210312478A CN102867036A CN 102867036 A CN102867036 A CN 102867036A CN 2012103124789 A CN2012103124789 A CN 2012103124789A CN 201210312478 A CN201210312478 A CN 201210312478A CN 102867036 A CN102867036 A CN 102867036A
- Authority
- CN
- China
- Prior art keywords
- node
- character
- nodes
- dfa
- inefficacy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses an improved method for the dynamic generation of a data structure for an Aho-Corasick algorithm. The method comprises the following steps of: adding and deleting feature character strings; splitting the feature character strings into single characters, and adding corresponding nodes to positions of a deterministic finite automaton (DFA); setting corresponding data at the new nodes, and checking failure targets of father nodes; finding nodes, namely rejecting first characters of character strings substituted by the nodes, and matching the DFA by using the rest of the character strings; finding an implementation home set of the failure targets, traversing the quotation of all nodes in the implementation home set, judging whether the nodes exist or not, and taking the nodes as failure target nodes; adding the nodes to a character set object set in a header of the DFA; sequentially reducing the character strings from back to front; and finding the corresponding nodes. The data structure is dynamically maintained, and the multi-mode matching retrieval of a great number of continuously variable character strings within a short time is facilitated.
Description
Technical field
The invention belongs to the computer theory field, be used to the Aho-Corasick algorithm of multi-pattern match that the Aho-Corasick-tree data structure of capable of dynamic plus-minus is provided.
Background technology
Along with the develop rapidly of infotech, especially on the problem that large data are processed, quick-searching how to realize critical field is more and more distinct issues.Especially in the WEB2.0 epoch, real-time mass data is traveled through or searches for is the operation of a normalization.On so the mass data amount is processed, often need simultaneously to retrieve a lot of kinds of characters strings, carry out the multi-mode matching operation, this just need to use the Aho-Corasick algorithm.But there is a problem in this algorithm, as an automat algorithm, its relies on is to derive from numerous feature strings and the tree form data structure that generates in advance, in case in operational process, need to increase or when deleting the feature string, need interrupt run, data structure before the deletion regenerates new data structure.If new set of strings is larger, such step just needs the suitable time to process.During this period, the processing of data just can't in time reflect, and therefore needs a kind of algorithm, can guarantee to realize multi-mode matching, can finish again the reorganization operation of data in finite time.
Summary of the invention
The object of the invention is to the deficiency for above-mentioned algorithm, by dynamically improving one's methods of generation of a kind of Aho-Corasick of realization algorithm of clothing data is provided, realize the Dynamic Maintenance to this data structure, the convenient realization carried out the multi-mode matching retrieval to the character strings of a large amount of continuous changes.
The present invention adopts following technological means to realize:
What a kind of Aho-Corasick of realization algorithm data dynamically generated improves one's methods, and comprises the operation that increases and delete feature string; Increasing feature string may further comprise the steps:
Step 1: feature string is splitted into single character, carry out single coupling via the DFA tree, when not having respective symbols among the DFA, increase respective nodes in this position of DFA;
Step 2: at new node corresponding data are set, check the inefficacy target of father node, whether the child that refers to character corresponding to this node is arranged.If any, be made as the inefficacy target of this node, execution in step 5; Such as nothing, execution in step 3;
Step 3: repeating step 4, as find node, stop; As do not have, kick out of again a character of a character string, repeating step 4; As until coupling all do not finished in surplus last character, so, execution in step 6;
Step 4: kick out of the first character that this node refers to character string, with remaining character string DFA is mated, as find the node that meets, as the inefficacy target, execution in step 5 is returned step 3 with this node; As not finding, also return step 3;
Step 5: find the realization ownership set of inefficacy target, traversal is quoting of all nodes wherein, checks to have or not the node should be with this node as its destination node that lost efficacy, if any, arrange;
Step 6: node is increased to the character set object set place of DFA head, refer to the character set object of character if any this node, add this node pointer in corresponding character set object, and travel through wherein quoting of node object, whether there is object should say that this node arranges as the inefficacy target, if any, arrange;
Reducing feature string may further comprise the steps:
Step 7: carry out from back to front successively the minimizing work to character string, single character operates, and repeated execution of steps 8 is not until step 8 is returned;
Step 8: find this corresponding node, do not have child such as this node, delete this node, and return step 7.
The present invention compared with prior art has following obvious advantage and beneficial effect:
What a kind of Aho-Corasick of realization algorithm of the present invention data dynamically generated improves one's methods, and has realized the Dynamic Maintenance to this data structure, has conveniently realized in the short period of time a large amount of constantly character strings of change being carried out the multi-mode matching retrieval.Namely guarantee the realization multi-mode matching, can in finite time, finish again the reorganization operation of data.
Description of drawings
Fig. 1 is the dynamic increase process flow diagram of string;
Fig. 2 is the dynamic deletion process flow diagram of string.
Embodiment
Below in conjunction with Figure of description specific embodiments of the invention are illustrated.
Technical scheme data definition part: the data structure that need to use and the definition of ingredient.
Definition 1: node object comprises: the 1) character that refers to of node; Corresponding character string when 2) arriving this node from root node; 3) child of node is quoted set; 4) all quote set (hereinafter general designation " ownership that lost efficacy set ") with this node as the node of inefficacy target; 5) degree of depth of the relative root node of node; 6) father node of node is quoted; 7) the inefficacy destination node of node is quoted; 8) mark whether it is the end of string node.
Definition 2: the gauge outfit object comprises: 1) character list object set.
Definition 3: the root node object comprises: the 1) general information of ordinary node object; 2) node refers to character for empty; 3) node failure target directing self.
Definition 4: the character list object comprises: 1) refer to character 2) block mark 3) all node characters and the node object set (hereinafter general designation claims " the invalidate object ownership is gathered ") that refers to the identical and inefficacy target directing root node of character.
Definition 5:Aho-Corasick data-structure tree object comprises: 1) root node object; 2) gauge outfit object; 3) all nodes are quoted set.
Algorithm steps is divided into to be increased and deletion two aspects:
See also shown in Figure 1ly, be the dynamic increase process flow diagram of string.
Action t1: increase new keywords: from root node, single character ground coupling needs the character string of increase, if any, then point to next; As do not have, then create a new node (t2).
Action t2: create new node: the father node that 1) adds this node is quoted; 2) quote in the set in the child node of father node and increase this node and quote; 3) character that refers to of this node is set; 4) character string that refers to of this node is set; 5) depth value of this node is set; 6) node is quoted added to all nodes of data tree and quote in the set; 7) end of new character strings identifies in this way, asks mark; 8) with this section spot correlation inefficacy target execution action t3 is set.
Action t3: at first look for the failure node of front nodal point, search and have or not corresponding this node to refer to the child node of character in the set of its child node, if any execution action t4; As there is not execution action t6.
Action t4: this node is set is the inefficacy destination node of this node, and travel through the inefficacy ownership set of this node, to each node in it, execution action t5; Travel through complete after, with this node, add the set of this inefficacy ownership, and the inefficacy destination node that this node is set is this node.
Action t5: check the degree of depth of traversal node, as lower than this node depth value or be equal to, then travel through next node; As not being, check the aft section substring of the character string that this traversal node refers to, whether identical with this node if referring to character string, if identical, then should travel through node deletion from the traversal set, add the inefficacy ownership set of this node, this node of inefficacy target bit of this node was set.
Action t6: with the aft section that refers to character string of this node, length is the substring of this node degree of depth little, is put in the data tree and mates, as do not have respective nodes, substring length is reduced by an again coupling again; As the match is successful on a substring, find final matched node, execution action t4; As until substring last the position do not have yet, with inefficacy target directing root node, execution action t7.
Action t7: the action of inefficacy target directing root node, whether the character list of inspection data tree object is gathered, refer to this node to refer to the identical character list object existence of character; If any, travel through the inefficacy ownership set of this character list object, to each node in it, execution action t5.
See also shown in Figure 2ly, be the dynamic deletion process flow diagram of string.
Action t8: reduce by an existing string, be actually and reduce by a strain back end; In data tree, mate this string, obtain last node, execution action t9.
Action t9: variable " origin node " is set equals this node, set the father node that this node points to this node, execution action t10 is until this node points to root node.
Action t10: do not have child node such as this node, travel through the inefficacy ownership set of this node, the inefficacy target of wherein object is changed into the inefficacy target of this node, again with all references in the inefficacy ownership set of this node, the inefficacy of adding the inefficacy destination node of this node to belongs in the set, such as the inefficacy target bit root node of this node, then this adds the inefficacy ownership set in the corresponding character list object to; In the child node set of the father node of this node, delete this node and quote, delete this node.
Claims (1)
1. realize dynamically improving one's methods of generation of Aho-Corasick algorithm data for one kind, comprise the operation that increases and delete feature string; It is characterized in that: described increase feature string may further comprise the steps:
Step 1: feature string is splitted into single character, carry out single coupling via the DFA tree, when not having respective symbols among the DFA, increase respective nodes in this position of DFA;
Step 2: at new node corresponding data are set, check the inefficacy target of father node, whether the child that refers to character corresponding to this node is arranged.If any, be made as the inefficacy target of this node, execution in step 5; Such as nothing, execution in step 3;
Step 3: repeating step 4, as find node, stop; As do not have, kick out of again a character of a character string, repeating step 4; As until coupling all do not finished in surplus last character, so, execution in step 6;
Step 4: kick out of the first character that this node refers to character string, with remaining character string DFA is mated, as find the node that meets, as the inefficacy target, execution in step 5 is returned step 3 with this node; As not finding, also return step 3;
Step 5: find the realization ownership set of inefficacy target, traversal is quoting of all nodes wherein, checks to have or not the node should be with this node as its destination node that lost efficacy, if any, arrange;
Step 6: node is increased to the character set object set place of DFA head, refer to the character set object of character if any this node, add this node pointer in corresponding character set object, and travel through wherein quoting of node object, whether there is object should say that this node arranges as the inefficacy target, if any, arrange;
Described minimizing feature string may further comprise the steps:
Step 7: carry out from back to front successively the minimizing work to character string, single character operates, and repeated execution of steps 8 is not until step 8 is returned;
Step 8: find this corresponding node, do not have child such as this node, delete this node, and return step 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210312478.9A CN102867036B (en) | 2012-08-29 | 2012-08-29 | Improved method for dynamic generation of data structure for Aho-Corasick algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210312478.9A CN102867036B (en) | 2012-08-29 | 2012-08-29 | Improved method for dynamic generation of data structure for Aho-Corasick algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102867036A true CN102867036A (en) | 2013-01-09 |
CN102867036B CN102867036B (en) | 2015-03-04 |
Family
ID=47445905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210312478.9A Expired - Fee Related CN102867036B (en) | 2012-08-29 | 2012-08-29 | Improved method for dynamic generation of data structure for Aho-Corasick algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102867036B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106067039A (en) * | 2016-05-30 | 2016-11-02 | 桂林电子科技大学 | Method for mode matching based on decision tree beta pruning |
CN107885492A (en) * | 2017-11-14 | 2018-04-06 | 中国银行股份有限公司 | The method and device of data structure dynamic generation in main frame |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070282835A1 (en) * | 2006-04-28 | 2007-12-06 | Roke Manor Research Limited | Aho-corasick methodology for string searching |
US20080046423A1 (en) * | 2006-08-01 | 2008-02-21 | Lucent Technologies Inc. | Method and system for multi-character multi-pattern pattern matching |
CN101551803A (en) * | 2008-03-31 | 2009-10-07 | 华为技术有限公司 | Method and device for establishing pattern matching state machine and pattern recognition |
CN101556619A (en) * | 2009-05-04 | 2009-10-14 | 成都市华为赛门铁克科技有限公司 | Node compression method and device thereof and multimode matching method and device thereof |
-
2012
- 2012-08-29 CN CN201210312478.9A patent/CN102867036B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070282835A1 (en) * | 2006-04-28 | 2007-12-06 | Roke Manor Research Limited | Aho-corasick methodology for string searching |
US20080046423A1 (en) * | 2006-08-01 | 2008-02-21 | Lucent Technologies Inc. | Method and system for multi-character multi-pattern pattern matching |
CN101551803A (en) * | 2008-03-31 | 2009-10-07 | 华为技术有限公司 | Method and device for establishing pattern matching state machine and pattern recognition |
CN101556619A (en) * | 2009-05-04 | 2009-10-14 | 成都市华为赛门铁克科技有限公司 | Node compression method and device thereof and multimode matching method and device thereof |
Non-Patent Citations (1)
Title |
---|
王杰 等: "一种快速高效的模式匹配算法的应用研究", 《计算机工程与应用》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106067039A (en) * | 2016-05-30 | 2016-11-02 | 桂林电子科技大学 | Method for mode matching based on decision tree beta pruning |
CN106067039B (en) * | 2016-05-30 | 2019-01-29 | 桂林电子科技大学 | Method for mode matching based on decision tree beta pruning |
CN107885492A (en) * | 2017-11-14 | 2018-04-06 | 中国银行股份有限公司 | The method and device of data structure dynamic generation in main frame |
Also Published As
Publication number | Publication date |
---|---|
CN102867036B (en) | 2015-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9495207B1 (en) | Cataloging data sets for reuse in pipeline applications | |
CN105389349B (en) | Dictionary update method and device | |
US7493319B1 (en) | Computer automated discovery of interestingness in faceted search | |
US7676453B2 (en) | Partial query caching | |
KR101617696B1 (en) | Method and device for mining data regular expression | |
CN105574054B (en) | A kind of distributed caching range query method, apparatus and system | |
CN102163226A (en) | Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation | |
CN106708956B (en) | A kind of HTTP data matching method based on more URL rule sets | |
CN104572983A (en) | Construction method based on hash table of memory, text searching method and corresponding device | |
CN106528846A (en) | Retrieval method and device | |
CN103020054A (en) | Fuzzy query method and system | |
CN101916281B (en) | Concurrent computational system and non-repetition counting method | |
CN106156171A (en) | A kind of enquiring and optimizing method of Virtual asset data | |
CN102867036B (en) | Improved method for dynamic generation of data structure for Aho-Corasick algorithm | |
US7941423B2 (en) | Virtual pair algorithm for outer join resolution | |
Conrad et al. | Towards Automated Schema Optimization. | |
CN107066587A (en) | A kind of efficient Mining Frequent Itemsets based on group chained list | |
CN105357177A (en) | Method for processing data packet filtering rule set and data packet matching method | |
CN103092960A (en) | Method for building software product feature tree model based on demand cluster | |
Heinrich et al. | Hybrid FPGA approach for a B+ tree in a semantic web database system | |
CN105608201A (en) | Text matching method supporting multi-keyword expression | |
Muhammad et al. | Multi query optimization algorithm using semantic and heuristic approaches | |
CN112464648A (en) | Industry standard blank feature recognition system and method based on multi-source data analysis | |
Jamadagni et al. | GoDB: From batch processing to distributed querying over property graphs | |
Sharma et al. | An efficient algorithm for improved web usage mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150304 Termination date: 20170829 |