CN102867036B - Improved method for dynamic generation of data structure for Aho-Corasick algorithm - Google Patents
Improved method for dynamic generation of data structure for Aho-Corasick algorithm Download PDFInfo
- Publication number
- CN102867036B CN102867036B CN201210312478.9A CN201210312478A CN102867036B CN 102867036 B CN102867036 B CN 102867036B CN 201210312478 A CN201210312478 A CN 201210312478A CN 102867036 B CN102867036 B CN 102867036B
- Authority
- CN
- China
- Prior art keywords
- node
- character
- nodes
- dfa
- faulty target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses an improved method for the dynamic generation of a data structure for an Aho-Corasick algorithm. The method comprises the following steps of: adding and deleting feature character strings; splitting the feature character strings into single characters, and adding corresponding nodes to positions of a deterministic finite automaton (DFA); setting corresponding data at the new nodes, and checking failure targets of father nodes; finding nodes, namely rejecting first characters of character strings substituted by the nodes, and matching the DFA by using the rest of the character strings; finding an implementation home set of the failure targets, traversing the quotation of all nodes in the implementation home set, judging whether the nodes exist or not, and taking the nodes as failure target nodes; adding the nodes to a character set object set in a header of the DFA; sequentially reducing the character strings from back to front; and finding the corresponding nodes. The data structure is dynamically maintained, and the multi-mode matching retrieval of a great number of continuously variable character strings within a short time is facilitated.
Description
Technical field
The invention belongs to computer theory field, for the Aho-Corasick-tree data structure providing dynamic to add and subtract for the Aho-Corasick algorithm of multi-pattern match.
Background technology
Along with the develop rapidly of infotech, especially in the problem of large data processing, the quick-searching how realizing critical field is more and more distinct issues.Especially in the WEB2.0 epoch, traveling through mass data or searching for of real-time is the operation of a normalization.In mass data amount like this process, often need to retrieve a lot of kinds of characters string simultaneously, carry out multi-mode matching operation, this just needs to use Aho-Corasick algorithm.But there is a problem in this algorithm, as an automat algorithm, what it relied on is the tree form data structure deriving from numerous feature string and generate in advance, once need to increase or delete feature string in operational process, need interrupt run, data structure before deleting, regenerates new data structure.If new set of strings is comparatively large, such step carries out having processed with regard to needing the suitable time.During this period, the process of data just cannot reflect in time, therefore needs a kind of algorithm, can ensure to realize multi-mode matching, can complete again the reorganization operation of data in finite time.
Summary of the invention
The object of the invention is to the deficiency for above-mentioned algorithm, what dynamically generate by providing a kind of Aho-Corasick of realization algorithm of clothing data improves one's methods, and realizes the Dynamic Maintenance to this data structure, and convenient realization carries out multi-mode matching retrieval to the character string of a large amount of constantly variation.
The present invention adopts following technological means to realize:
What a kind of Aho-Corasick of realization algorithm data dynamically generated improves one's methods, and comprises the operation increasing and delete feature string; Increase feature string to comprise the following steps:
Step 1: feature string is splitted into single character, carrying out single coupling via DFA tree, when there is not respective symbols in DFA, this position of DFA increasing respective nodes;
Step 2: arrange corresponding data at new node, checks the faulty target of father node, whether has the child referring to character corresponding to this node.If any, be set to the faulty target of this node, perform step 5; As nothing, perform step 3;
Step 3: repeat step 4, as found node, stops; As not having, then kicking out of the head character of a character string, repeating step 4; As until last character surplus, all do not complete coupling, so, perform step 6;
Step 4: kick out of the first character that this node refers to character string, mates DFA by remaining character string, the node met as found, and using this node as faulty target, performs step 5, returns step 3; As do not found, also return step 3;
Step 5: the inefficacy ownership set finding faulty target, traversal wherein the quoting of all nodes, checking should using this node as its faulty target node with or without node, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Step 6: node is increased to the character set object set place of DFA head, the character set object of character is referred to if any this node, add this node pointer in corresponding character set object, and travel through quoting of wherein node object, check whether that this node should arrange as faulty target by object, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Deletion feature string comprises the following steps:
Step 7: carry out the deletion work to character string successively from back to front, repeated execution of steps 8, until step 8 does not return;
Step 8: find this corresponding node, as this node does not have child, deletes this node, and returns step 7.
The present invention compared with prior art, has following obvious advantage and beneficial effect:
What a kind of Aho-Corasick of realization algorithm of the present invention data dynamically generated improves one's methods, and achieves the Dynamic Maintenance to this data structure, conveniently achieves and carries out multi-mode matching retrieval to the character string of a large amount of constantly variation in the short period of time.Namely ensure that and realize multi-mode matching, the reorganization operation of data can be completed again in finite time.
Accompanying drawing explanation
Fig. 1 is the dynamic increase process flow diagram of string;
Fig. 2 is the dynamic deletion process flow diagram of string.
Embodiment
Below in conjunction with Figure of description, specific embodiments of the invention are illustrated.
Technical scheme data definitional part: the definition needing data structure and the ingredient used.
Definition 1: node object, comprising: the 1) character that refers to of node; 2) character string corresponding when arriving this node from root node; 3) child of node quotes set; 4) all nodes using this node as faulty target quote set (being hereafter referred to as " ownership that lost efficacy set "); 5) degree of depth of the relative root node of node; 6) father node of node is quoted; 7) the faulty target node of node is quoted; 8) mark whether it is end of string node.
Definition 2: gauge outfit object, comprising: 1) character list object set.
Definition 3: root node object, comprising: the 1) general information of ordinary node object; 2) node refers to character for empty; 3) node failure target directing self.
Definition 4: character list object, comprising: 1) refer to character 2) block mark 3) all node character and faulty target identical with referring to character point to the node object set (be hereafter referred to as and claim " invalidate object belongs to and gathers ") of root node.
Definition 5:Aho-Corasick data-structure tree object, comprising: 1) root node object; 2) gauge outfit object; 3) all nodes quote set.
Algorithm steps is divided into be increased and deletes two aspects:
Referring to shown in Fig. 1, is the dynamic increase process flow diagram of string.
Action t1: increase new keywords: from root node, single character ground coupling needs the character string increased, if any, then point to the next one; As not having, then the node (t2) that establishment one is new.
Action t2: create new node: the father node 1) adding this node is quoted; 2) quote in set in the child node of father node and increase this node and quote; 3) what arrange this node refers to character; 4) what arrange this node refers to character string; 5) depth value of this node is set; 6) node is quoted add all nodes of data tree to and quote in set; 7) end of new character strings identifies in this way, please mark; 8) arrange with this node relevant failure target, perform an action t3.
Action t3: the failure node first looking for front nodal point, searches the child node referring to character in its child node set with or without this node corresponding, if any the t4 that performs an action; As not having, perform an action t6.
Action t4: the faulty target node that this node is this node is set, and the inefficacy ownership set traveling through this node, to node each in it, perform an action t5; After traversal, by this node, add the set of this inefficacy ownership, and the faulty target node arranging this node is this node.
Action t5: the degree of depth checking traverse node, as lower than this node depth value or equivalent, then travels through next node; If not, checking the aft section substring of the character string that this traverse node refers to, whether identical with this node if referring to character string, if identical, then deleted from traversal set by this traverse node, add the inefficacy ownership set of this node, the faulty target arranging this node is this node.
Action t6: by the aft section referring to character string of this node, length is the substring of this node degree of depth little, is put in data tree and mates, and as not having respective nodes, then substring length is reduced one mates; As the match is successful on a substring, find final matched node, perform an action t4; As until last position of substring does not have yet, faulty target is pointed to root node, and perform an action t7.
Action t7: faulty target points to root node action, checks the character list set of data tree object, whether refers to the character list object referring to character identical with this node and exist; If any, travel through the inefficacy ownership set of this character list object, to each node in it, perform an action t5.
Referring to shown in Fig. 2, is the dynamic deletion process flow diagram of string.
Action t8: reduce by an existing string, is actually minimizing one strain back end; In data tree, mate this string, obtain last node, perform an action t9.
Action t9: arrange variable " origin node " and equal this node, set the father node that this node points to this node, perform an action t10, until this node points to root node.
Action t10: as this node does not have child node, travel through the inefficacy ownership set of this node, the faulty target of object is wherein changed into the faulty target of this node, again by all references in the set of the inefficacy of this node ownership, add in the inefficacy ownership set of the faulty target node of this node, as the faulty target position root node of this node, then this adds the inefficacy ownership set in corresponding character list object to; In the child node set of the father node of this node, delete this node and quote, delete this node.
Claims (1)
1. improving one's methods of realizing that Aho-Corasick algorithm data dynamically generates, comprises the operation increasing and delete feature string; It is characterized in that: described increase feature string comprises the following steps:
Step 1: feature string is splitted into single character, carrying out single coupling via DFA tree, when there is not respective symbols in DFA, this position of DFA increasing respective nodes;
Step 2: arrange corresponding data at new node, checks the faulty target of father node, whether has the child referring to character corresponding to this node; If any, be set to the faulty target of this node, perform step 5; As nothing, perform step 3;
Step 3: repeat step 4, as found node, stops; As not having, then kicking out of the head character of a character string, repeating step 4; As until last character surplus, all do not complete coupling, so, perform step 6;
Step 4: kick out of the first character that this node refers to character string, mates DFA by remaining character string, the node met as found, and using this node as faulty target, performs step 5, returns step 3; As do not found, also return step 3;
Step 5: the inefficacy ownership set finding faulty target, traversal wherein the quoting of all nodes, checking should using this node as its faulty target node with or without node, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Step 6: node is increased to the character set object set place of DFA head, the character set object of character is referred to if any this node, add this node pointer in corresponding character set object, and travel through quoting of wherein node object, check whether that this node should arrange as faulty target by object, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Described deletion feature string comprises the following steps:
Step 7: carry out the deletion work to character string successively from back to front, repeated execution of steps 8, until step 8 does not return;
Step 8: find this corresponding node, as this node does not have child, deletes this node, and returns step 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210312478.9A CN102867036B (en) | 2012-08-29 | 2012-08-29 | Improved method for dynamic generation of data structure for Aho-Corasick algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210312478.9A CN102867036B (en) | 2012-08-29 | 2012-08-29 | Improved method for dynamic generation of data structure for Aho-Corasick algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102867036A CN102867036A (en) | 2013-01-09 |
CN102867036B true CN102867036B (en) | 2015-03-04 |
Family
ID=47445905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210312478.9A Expired - Fee Related CN102867036B (en) | 2012-08-29 | 2012-08-29 | Improved method for dynamic generation of data structure for Aho-Corasick algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102867036B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106067039B (en) * | 2016-05-30 | 2019-01-29 | 桂林电子科技大学 | Method for mode matching based on decision tree beta pruning |
CN107885492B (en) * | 2017-11-14 | 2021-03-12 | 中国银行股份有限公司 | Method and device for dynamically generating data structure in host |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101551803A (en) * | 2008-03-31 | 2009-10-07 | 华为技术有限公司 | Method and device for establishing pattern matching state machine and pattern recognition |
CN101556619A (en) * | 2009-05-04 | 2009-10-14 | 成都市华为赛门铁克科技有限公司 | Node compression method and device thereof and multimode matching method and device thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2437560A (en) * | 2006-04-28 | 2007-10-31 | Roke Manor Research | Constructing Aho Corasick trees |
US7725510B2 (en) * | 2006-08-01 | 2010-05-25 | Alcatel-Lucent Usa Inc. | Method and system for multi-character multi-pattern pattern matching |
-
2012
- 2012-08-29 CN CN201210312478.9A patent/CN102867036B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101551803A (en) * | 2008-03-31 | 2009-10-07 | 华为技术有限公司 | Method and device for establishing pattern matching state machine and pattern recognition |
CN101556619A (en) * | 2009-05-04 | 2009-10-14 | 成都市华为赛门铁克科技有限公司 | Node compression method and device thereof and multimode matching method and device thereof |
Non-Patent Citations (1)
Title |
---|
一种快速高效的模式匹配算法的应用研究;王杰 等;《计算机工程与应用》;20081111;93-95,185 * |
Also Published As
Publication number | Publication date |
---|---|
CN102867036A (en) | 2013-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11681702B2 (en) | Conversion of model views into relational models | |
US9495207B1 (en) | Cataloging data sets for reuse in pipeline applications | |
CN105389349B (en) | Dictionary update method and device | |
CA2562281C (en) | Partial query caching | |
CN105574054B (en) | A kind of distributed caching range query method, apparatus and system | |
US11176159B1 (en) | Systems and methods for data analytics | |
KR101617696B1 (en) | Method and device for mining data regular expression | |
CN102033748A (en) | Method for generating data processing flow codes | |
US12001425B2 (en) | Duplication elimination in depth based searches for distributed systems | |
CN106611037A (en) | Method and device for distributed diagram calculation | |
TWI706260B (en) | Index establishment method and device based on mobile terminal NoSQL database | |
CN101916281B (en) | Concurrent computational system and non-repetition counting method | |
CN105471893B (en) | A kind of distributed equivalent data flow connection method | |
CN106227799A (en) | A kind of sql statement processing method based on distributed data base | |
EP3717997A1 (en) | Cardinality estimation in databases | |
CN103309873B (en) | The processing method of data, apparatus and system | |
CN102867036B (en) | Improved method for dynamic generation of data structure for Aho-Corasick algorithm | |
CN106156171A (en) | A kind of enquiring and optimizing method of Virtual asset data | |
WO2016177027A1 (en) | Batch data query method and device | |
KR101955376B1 (en) | Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method | |
CN103092960A (en) | Method for building software product feature tree model based on demand cluster | |
CN105608201A (en) | Text matching method supporting multi-keyword expression | |
CN107679240B (en) | Virtual identity mining method | |
CN105357177A (en) | Method for processing data packet filtering rule set and data packet matching method | |
CN106445968A (en) | Data merging method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150304 Termination date: 20170829 |
|
CF01 | Termination of patent right due to non-payment of annual fee |