CN102867036A - Improved method for dynamic generation of data structure for Aho-Corasick algorithm - Google Patents

Improved method for dynamic generation of data structure for Aho-Corasick algorithm Download PDF

Info

Publication number
CN102867036A
CN102867036A CN2012103124789A CN201210312478A CN102867036A CN 102867036 A CN102867036 A CN 102867036A CN 2012103124789 A CN2012103124789 A CN 2012103124789A CN 201210312478 A CN201210312478 A CN 201210312478A CN 102867036 A CN102867036 A CN 102867036A
Authority
CN
China
Prior art keywords
node
character
nodes
dfa
inefficacy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012103124789A
Other languages
Chinese (zh)
Other versions
CN102867036B (en
Inventor
张正欣
张建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201210312478.9A priority Critical patent/CN102867036B/en
Publication of CN102867036A publication Critical patent/CN102867036A/en
Application granted granted Critical
Publication of CN102867036B publication Critical patent/CN102867036B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an improved method for the dynamic generation of a data structure for an Aho-Corasick algorithm. The method comprises the following steps of: adding and deleting feature character strings; splitting the feature character strings into single characters, and adding corresponding nodes to positions of a deterministic finite automaton (DFA); setting corresponding data at the new nodes, and checking failure targets of father nodes; finding nodes, namely rejecting first characters of character strings substituted by the nodes, and matching the DFA by using the rest of the character strings; finding an implementation home set of the failure targets, traversing the quotation of all nodes in the implementation home set, judging whether the nodes exist or not, and taking the nodes as failure target nodes; adding the nodes to a character set object set in a header of the DFA; sequentially reducing the character strings from back to front; and finding the corresponding nodes. The data structure is dynamically maintained, and the multi-mode matching retrieval of a great number of continuously variable character strings within a short time is facilitated.

Description

Realize dynamically improving one's methods of generation of Aho-Corasick algorithm data
Technical field
The invention belongs to the computer theory field, be used to the Aho-Corasick algorithm of multi-pattern match that the Aho-Corasick-tree data structure of capable of dynamic plus-minus is provided.
Background technology
Along with the develop rapidly of infotech, especially on the problem that large data are processed, quick-searching how to realize critical field is more and more distinct issues.Especially in the WEB2.0 epoch, real-time mass data is traveled through or searches for is the operation of a normalization.On so the mass data amount is processed, often need simultaneously to retrieve a lot of kinds of characters strings, carry out the multi-mode matching operation, this just need to use the Aho-Corasick algorithm.But there is a problem in this algorithm, as an automat algorithm, its relies on is to derive from numerous feature strings and the tree form data structure that generates in advance, in case in operational process, need to increase or when deleting the feature string, need interrupt run, data structure before the deletion regenerates new data structure.If new set of strings is larger, such step just needs the suitable time to process.During this period, the processing of data just can't in time reflect, and therefore needs a kind of algorithm, can guarantee to realize multi-mode matching, can finish again the reorganization operation of data in finite time.
Summary of the invention
The object of the invention is to the deficiency for above-mentioned algorithm, by dynamically improving one's methods of generation of a kind of Aho-Corasick of realization algorithm of clothing data is provided, realize the Dynamic Maintenance to this data structure, the convenient realization carried out the multi-mode matching retrieval to the character strings of a large amount of continuous changes.
The present invention adopts following technological means to realize:
What a kind of Aho-Corasick of realization algorithm data dynamically generated improves one's methods, and comprises the operation that increases and delete feature string; Increasing feature string may further comprise the steps:
Step 1: feature string is splitted into single character, carry out single coupling via the DFA tree, when not having respective symbols among the DFA, increase respective nodes in this position of DFA;
Step 2: at new node corresponding data are set, check the inefficacy target of father node, whether the child that refers to character corresponding to this node is arranged.If any, be made as the inefficacy target of this node, execution in step 5; Such as nothing, execution in step 3;
Step 3: repeating step 4, as find node, stop; As do not have, kick out of again a character of a character string, repeating step 4; As until coupling all do not finished in surplus last character, so, execution in step 6;
Step 4: kick out of the first character that this node refers to character string, with remaining character string DFA is mated, as find the node that meets, as the inefficacy target, execution in step 5 is returned step 3 with this node; As not finding, also return step 3;
Step 5: find the realization ownership set of inefficacy target, traversal is quoting of all nodes wherein, checks to have or not the node should be with this node as its destination node that lost efficacy, if any, arrange;
Step 6: node is increased to the character set object set place of DFA head, refer to the character set object of character if any this node, add this node pointer in corresponding character set object, and travel through wherein quoting of node object, whether there is object should say that this node arranges as the inefficacy target, if any, arrange;
Reducing feature string may further comprise the steps:
Step 7: carry out from back to front successively the minimizing work to character string, single character operates, and repeated execution of steps 8 is not until step 8 is returned;
Step 8: find this corresponding node, do not have child such as this node, delete this node, and return step 7.
The present invention compared with prior art has following obvious advantage and beneficial effect:
What a kind of Aho-Corasick of realization algorithm of the present invention data dynamically generated improves one's methods, and has realized the Dynamic Maintenance to this data structure, has conveniently realized in the short period of time a large amount of constantly character strings of change being carried out the multi-mode matching retrieval.Namely guarantee the realization multi-mode matching, can in finite time, finish again the reorganization operation of data.
Description of drawings
Fig. 1 is the dynamic increase process flow diagram of string;
Fig. 2 is the dynamic deletion process flow diagram of string.
Embodiment
Below in conjunction with Figure of description specific embodiments of the invention are illustrated.
Technical scheme data definition part: the data structure that need to use and the definition of ingredient.
Definition 1: node object comprises: the 1) character that refers to of node; Corresponding character string when 2) arriving this node from root node; 3) child of node is quoted set; 4) all quote set (hereinafter general designation " ownership that lost efficacy set ") with this node as the node of inefficacy target; 5) degree of depth of the relative root node of node; 6) father node of node is quoted; 7) the inefficacy destination node of node is quoted; 8) mark whether it is the end of string node.
Definition 2: the gauge outfit object comprises: 1) character list object set.
Definition 3: the root node object comprises: the 1) general information of ordinary node object; 2) node refers to character for empty; 3) node failure target directing self.
Definition 4: the character list object comprises: 1) refer to character 2) block mark 3) all node characters and the node object set (hereinafter general designation claims " the invalidate object ownership is gathered ") that refers to the identical and inefficacy target directing root node of character.
Definition 5:Aho-Corasick data-structure tree object comprises: 1) root node object; 2) gauge outfit object; 3) all nodes are quoted set.
Algorithm steps is divided into to be increased and deletion two aspects:
See also shown in Figure 1ly, be the dynamic increase process flow diagram of string.
Action t1: increase new keywords: from root node, single character ground coupling needs the character string of increase, if any, then point to next; As do not have, then create a new node (t2).
Action t2: create new node: the father node that 1) adds this node is quoted; 2) quote in the set in the child node of father node and increase this node and quote; 3) character that refers to of this node is set; 4) character string that refers to of this node is set; 5) depth value of this node is set; 6) node is quoted added to all nodes of data tree and quote in the set; 7) end of new character strings identifies in this way, asks mark; 8) with this section spot correlation inefficacy target execution action t3 is set.
Action t3: at first look for the failure node of front nodal point, search and have or not corresponding this node to refer to the child node of character in the set of its child node, if any execution action t4; As there is not execution action t6.
Action t4: this node is set is the inefficacy destination node of this node, and travel through the inefficacy ownership set of this node, to each node in it, execution action t5; Travel through complete after, with this node, add the set of this inefficacy ownership, and the inefficacy destination node that this node is set is this node.
Action t5: check the degree of depth of traversal node, as lower than this node depth value or be equal to, then travel through next node; As not being, check the aft section substring of the character string that this traversal node refers to, whether identical with this node if referring to character string, if identical, then should travel through node deletion from the traversal set, add the inefficacy ownership set of this node, this node of inefficacy target bit of this node was set.
Action t6: with the aft section that refers to character string of this node, length is the substring of this node degree of depth little, is put in the data tree and mates, as do not have respective nodes, substring length is reduced by an again coupling again; As the match is successful on a substring, find final matched node, execution action t4; As until substring last the position do not have yet, with inefficacy target directing root node, execution action t7.
Action t7: the action of inefficacy target directing root node, whether the character list of inspection data tree object is gathered, refer to this node to refer to the identical character list object existence of character; If any, travel through the inefficacy ownership set of this character list object, to each node in it, execution action t5.
See also shown in Figure 2ly, be the dynamic deletion process flow diagram of string.
Action t8: reduce by an existing string, be actually and reduce by a strain back end; In data tree, mate this string, obtain last node, execution action t9.
Action t9: variable " origin node " is set equals this node, set the father node that this node points to this node, execution action t10 is until this node points to root node.
Action t10: do not have child node such as this node, travel through the inefficacy ownership set of this node, the inefficacy target of wherein object is changed into the inefficacy target of this node, again with all references in the inefficacy ownership set of this node, the inefficacy of adding the inefficacy destination node of this node to belongs in the set, such as the inefficacy target bit root node of this node, then this adds the inefficacy ownership set in the corresponding character list object to; In the child node set of the father node of this node, delete this node and quote, delete this node.

Claims (1)

1. realize dynamically improving one's methods of generation of Aho-Corasick algorithm data for one kind, comprise the operation that increases and delete feature string; It is characterized in that: described increase feature string may further comprise the steps:
Step 1: feature string is splitted into single character, carry out single coupling via the DFA tree, when not having respective symbols among the DFA, increase respective nodes in this position of DFA;
Step 2: at new node corresponding data are set, check the inefficacy target of father node, whether the child that refers to character corresponding to this node is arranged.If any, be made as the inefficacy target of this node, execution in step 5; Such as nothing, execution in step 3;
Step 3: repeating step 4, as find node, stop; As do not have, kick out of again a character of a character string, repeating step 4; As until coupling all do not finished in surplus last character, so, execution in step 6;
Step 4: kick out of the first character that this node refers to character string, with remaining character string DFA is mated, as find the node that meets, as the inefficacy target, execution in step 5 is returned step 3 with this node; As not finding, also return step 3;
Step 5: find the realization ownership set of inefficacy target, traversal is quoting of all nodes wherein, checks to have or not the node should be with this node as its destination node that lost efficacy, if any, arrange;
Step 6: node is increased to the character set object set place of DFA head, refer to the character set object of character if any this node, add this node pointer in corresponding character set object, and travel through wherein quoting of node object, whether there is object should say that this node arranges as the inefficacy target, if any, arrange;
Described minimizing feature string may further comprise the steps:
Step 7: carry out from back to front successively the minimizing work to character string, single character operates, and repeated execution of steps 8 is not until step 8 is returned;
Step 8: find this corresponding node, do not have child such as this node, delete this node, and return step 7.
CN201210312478.9A 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm Expired - Fee Related CN102867036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210312478.9A CN102867036B (en) 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210312478.9A CN102867036B (en) 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm

Publications (2)

Publication Number Publication Date
CN102867036A true CN102867036A (en) 2013-01-09
CN102867036B CN102867036B (en) 2015-03-04

Family

ID=47445905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210312478.9A Expired - Fee Related CN102867036B (en) 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm

Country Status (1)

Country Link
CN (1) CN102867036B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106067039A (en) * 2016-05-30 2016-11-02 桂林电子科技大学 Method for mode matching based on decision tree beta pruning
CN107885492A (en) * 2017-11-14 2018-04-06 中国银行股份有限公司 The method and device of data structure dynamic generation in main frame

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070282835A1 (en) * 2006-04-28 2007-12-06 Roke Manor Research Limited Aho-corasick methodology for string searching
US20080046423A1 (en) * 2006-08-01 2008-02-21 Lucent Technologies Inc. Method and system for multi-character multi-pattern pattern matching
CN101551803A (en) * 2008-03-31 2009-10-07 华为技术有限公司 Method and device for establishing pattern matching state machine and pattern recognition
CN101556619A (en) * 2009-05-04 2009-10-14 成都市华为赛门铁克科技有限公司 Node compression method and device thereof and multimode matching method and device thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070282835A1 (en) * 2006-04-28 2007-12-06 Roke Manor Research Limited Aho-corasick methodology for string searching
US20080046423A1 (en) * 2006-08-01 2008-02-21 Lucent Technologies Inc. Method and system for multi-character multi-pattern pattern matching
CN101551803A (en) * 2008-03-31 2009-10-07 华为技术有限公司 Method and device for establishing pattern matching state machine and pattern recognition
CN101556619A (en) * 2009-05-04 2009-10-14 成都市华为赛门铁克科技有限公司 Node compression method and device thereof and multimode matching method and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王杰 等: "一种快速高效的模式匹配算法的应用研究", 《计算机工程与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106067039A (en) * 2016-05-30 2016-11-02 桂林电子科技大学 Method for mode matching based on decision tree beta pruning
CN106067039B (en) * 2016-05-30 2019-01-29 桂林电子科技大学 Method for mode matching based on decision tree beta pruning
CN107885492A (en) * 2017-11-14 2018-04-06 中国银行股份有限公司 The method and device of data structure dynamic generation in main frame

Also Published As

Publication number Publication date
CN102867036B (en) 2015-03-04

Similar Documents

Publication Publication Date Title
US9495207B1 (en) Cataloging data sets for reuse in pipeline applications
CN105389349B (en) Dictionary update method and device
US7493319B1 (en) Computer automated discovery of interestingness in faceted search
US7676453B2 (en) Partial query caching
KR101617696B1 (en) Method and device for mining data regular expression
CN105574054B (en) A kind of distributed caching range query method, apparatus and system
CN102163226A (en) Adjacent sorting repetition-reducing method based on Map-Reduce and segmentation
CN106708956B (en) A kind of HTTP data matching method based on more URL rule sets
CN104572983A (en) Construction method based on hash table of memory, text searching method and corresponding device
CN106528846A (en) Retrieval method and device
CN103020054A (en) Fuzzy query method and system
CN101916281B (en) Concurrent computational system and non-repetition counting method
CN106156171A (en) A kind of enquiring and optimizing method of Virtual asset data
CN102867036B (en) Improved method for dynamic generation of data structure for Aho-Corasick algorithm
US7941423B2 (en) Virtual pair algorithm for outer join resolution
Conrad et al. Towards Automated Schema Optimization.
CN107066587A (en) A kind of efficient Mining Frequent Itemsets based on group chained list
CN105357177A (en) Method for processing data packet filtering rule set and data packet matching method
CN103092960A (en) Method for building software product feature tree model based on demand cluster
Heinrich et al. Hybrid FPGA approach for a B+ tree in a semantic web database system
CN105608201A (en) Text matching method supporting multi-keyword expression
Muhammad et al. Multi query optimization algorithm using semantic and heuristic approaches
CN112464648A (en) Industry standard blank feature recognition system and method based on multi-source data analysis
Jamadagni et al. GoDB: From batch processing to distributed querying over property graphs
Sharma et al. An efficient algorithm for improved web usage mining

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150304

Termination date: 20170829