CN102867036B - Improved method for dynamic generation of data structure for Aho-Corasick algorithm - Google Patents

Improved method for dynamic generation of data structure for Aho-Corasick algorithm Download PDF

Info

Publication number
CN102867036B
CN102867036B CN201210312478.9A CN201210312478A CN102867036B CN 102867036 B CN102867036 B CN 102867036B CN 201210312478 A CN201210312478 A CN 201210312478A CN 102867036 B CN102867036 B CN 102867036B
Authority
CN
China
Prior art keywords
node
character
nodes
dfa
faulty target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210312478.9A
Other languages
Chinese (zh)
Other versions
CN102867036A (en
Inventor
张正欣
张建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201210312478.9A priority Critical patent/CN102867036B/en
Publication of CN102867036A publication Critical patent/CN102867036A/en
Application granted granted Critical
Publication of CN102867036B publication Critical patent/CN102867036B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses an improved method for the dynamic generation of a data structure for an Aho-Corasick algorithm. The method comprises the following steps of: adding and deleting feature character strings; splitting the feature character strings into single characters, and adding corresponding nodes to positions of a deterministic finite automaton (DFA); setting corresponding data at the new nodes, and checking failure targets of father nodes; finding nodes, namely rejecting first characters of character strings substituted by the nodes, and matching the DFA by using the rest of the character strings; finding an implementation home set of the failure targets, traversing the quotation of all nodes in the implementation home set, judging whether the nodes exist or not, and taking the nodes as failure target nodes; adding the nodes to a character set object set in a header of the DFA; sequentially reducing the character strings from back to front; and finding the corresponding nodes. The data structure is dynamically maintained, and the multi-mode matching retrieval of a great number of continuously variable character strings within a short time is facilitated.

Description

What realize that Aho-Corasick algorithm data dynamically generates improves one's methods
Technical field
The invention belongs to computer theory field, for the Aho-Corasick-tree data structure providing dynamic to add and subtract for the Aho-Corasick algorithm of multi-pattern match.
Background technology
Along with the develop rapidly of infotech, especially in the problem of large data processing, the quick-searching how realizing critical field is more and more distinct issues.Especially in the WEB2.0 epoch, traveling through mass data or searching for of real-time is the operation of a normalization.In mass data amount like this process, often need to retrieve a lot of kinds of characters string simultaneously, carry out multi-mode matching operation, this just needs to use Aho-Corasick algorithm.But there is a problem in this algorithm, as an automat algorithm, what it relied on is the tree form data structure deriving from numerous feature string and generate in advance, once need to increase or delete feature string in operational process, need interrupt run, data structure before deleting, regenerates new data structure.If new set of strings is comparatively large, such step carries out having processed with regard to needing the suitable time.During this period, the process of data just cannot reflect in time, therefore needs a kind of algorithm, can ensure to realize multi-mode matching, can complete again the reorganization operation of data in finite time.
Summary of the invention
The object of the invention is to the deficiency for above-mentioned algorithm, what dynamically generate by providing a kind of Aho-Corasick of realization algorithm of clothing data improves one's methods, and realizes the Dynamic Maintenance to this data structure, and convenient realization carries out multi-mode matching retrieval to the character string of a large amount of constantly variation.
The present invention adopts following technological means to realize:
What a kind of Aho-Corasick of realization algorithm data dynamically generated improves one's methods, and comprises the operation increasing and delete feature string; Increase feature string to comprise the following steps:
Step 1: feature string is splitted into single character, carrying out single coupling via DFA tree, when there is not respective symbols in DFA, this position of DFA increasing respective nodes;
Step 2: arrange corresponding data at new node, checks the faulty target of father node, whether has the child referring to character corresponding to this node.If any, be set to the faulty target of this node, perform step 5; As nothing, perform step 3;
Step 3: repeat step 4, as found node, stops; As not having, then kicking out of the head character of a character string, repeating step 4; As until last character surplus, all do not complete coupling, so, perform step 6;
Step 4: kick out of the first character that this node refers to character string, mates DFA by remaining character string, the node met as found, and using this node as faulty target, performs step 5, returns step 3; As do not found, also return step 3;
Step 5: the inefficacy ownership set finding faulty target, traversal wherein the quoting of all nodes, checking should using this node as its faulty target node with or without node, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Step 6: node is increased to the character set object set place of DFA head, the character set object of character is referred to if any this node, add this node pointer in corresponding character set object, and travel through quoting of wherein node object, check whether that this node should arrange as faulty target by object, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Deletion feature string comprises the following steps:
Step 7: carry out the deletion work to character string successively from back to front, repeated execution of steps 8, until step 8 does not return;
Step 8: find this corresponding node, as this node does not have child, deletes this node, and returns step 7.
The present invention compared with prior art, has following obvious advantage and beneficial effect:
What a kind of Aho-Corasick of realization algorithm of the present invention data dynamically generated improves one's methods, and achieves the Dynamic Maintenance to this data structure, conveniently achieves and carries out multi-mode matching retrieval to the character string of a large amount of constantly variation in the short period of time.Namely ensure that and realize multi-mode matching, the reorganization operation of data can be completed again in finite time.
Accompanying drawing explanation
Fig. 1 is the dynamic increase process flow diagram of string;
Fig. 2 is the dynamic deletion process flow diagram of string.
Embodiment
Below in conjunction with Figure of description, specific embodiments of the invention are illustrated.
Technical scheme data definitional part: the definition needing data structure and the ingredient used.
Definition 1: node object, comprising: the 1) character that refers to of node; 2) character string corresponding when arriving this node from root node; 3) child of node quotes set; 4) all nodes using this node as faulty target quote set (being hereafter referred to as " ownership that lost efficacy set "); 5) degree of depth of the relative root node of node; 6) father node of node is quoted; 7) the faulty target node of node is quoted; 8) mark whether it is end of string node.
Definition 2: gauge outfit object, comprising: 1) character list object set.
Definition 3: root node object, comprising: the 1) general information of ordinary node object; 2) node refers to character for empty; 3) node failure target directing self.
Definition 4: character list object, comprising: 1) refer to character 2) block mark 3) all node character and faulty target identical with referring to character point to the node object set (be hereafter referred to as and claim " invalidate object belongs to and gathers ") of root node.
Definition 5:Aho-Corasick data-structure tree object, comprising: 1) root node object; 2) gauge outfit object; 3) all nodes quote set.
Algorithm steps is divided into be increased and deletes two aspects:
Referring to shown in Fig. 1, is the dynamic increase process flow diagram of string.
Action t1: increase new keywords: from root node, single character ground coupling needs the character string increased, if any, then point to the next one; As not having, then the node (t2) that establishment one is new.
Action t2: create new node: the father node 1) adding this node is quoted; 2) quote in set in the child node of father node and increase this node and quote; 3) what arrange this node refers to character; 4) what arrange this node refers to character string; 5) depth value of this node is set; 6) node is quoted add all nodes of data tree to and quote in set; 7) end of new character strings identifies in this way, please mark; 8) arrange with this node relevant failure target, perform an action t3.
Action t3: the failure node first looking for front nodal point, searches the child node referring to character in its child node set with or without this node corresponding, if any the t4 that performs an action; As not having, perform an action t6.
Action t4: the faulty target node that this node is this node is set, and the inefficacy ownership set traveling through this node, to node each in it, perform an action t5; After traversal, by this node, add the set of this inefficacy ownership, and the faulty target node arranging this node is this node.
Action t5: the degree of depth checking traverse node, as lower than this node depth value or equivalent, then travels through next node; If not, checking the aft section substring of the character string that this traverse node refers to, whether identical with this node if referring to character string, if identical, then deleted from traversal set by this traverse node, add the inefficacy ownership set of this node, the faulty target arranging this node is this node.
Action t6: by the aft section referring to character string of this node, length is the substring of this node degree of depth little, is put in data tree and mates, and as not having respective nodes, then substring length is reduced one mates; As the match is successful on a substring, find final matched node, perform an action t4; As until last position of substring does not have yet, faulty target is pointed to root node, and perform an action t7.
Action t7: faulty target points to root node action, checks the character list set of data tree object, whether refers to the character list object referring to character identical with this node and exist; If any, travel through the inefficacy ownership set of this character list object, to each node in it, perform an action t5.
Referring to shown in Fig. 2, is the dynamic deletion process flow diagram of string.
Action t8: reduce by an existing string, is actually minimizing one strain back end; In data tree, mate this string, obtain last node, perform an action t9.
Action t9: arrange variable " origin node " and equal this node, set the father node that this node points to this node, perform an action t10, until this node points to root node.
Action t10: as this node does not have child node, travel through the inefficacy ownership set of this node, the faulty target of object is wherein changed into the faulty target of this node, again by all references in the set of the inefficacy of this node ownership, add in the inefficacy ownership set of the faulty target node of this node, as the faulty target position root node of this node, then this adds the inefficacy ownership set in corresponding character list object to; In the child node set of the father node of this node, delete this node and quote, delete this node.

Claims (1)

1. improving one's methods of realizing that Aho-Corasick algorithm data dynamically generates, comprises the operation increasing and delete feature string; It is characterized in that: described increase feature string comprises the following steps:
Step 1: feature string is splitted into single character, carrying out single coupling via DFA tree, when there is not respective symbols in DFA, this position of DFA increasing respective nodes;
Step 2: arrange corresponding data at new node, checks the faulty target of father node, whether has the child referring to character corresponding to this node; If any, be set to the faulty target of this node, perform step 5; As nothing, perform step 3;
Step 3: repeat step 4, as found node, stops; As not having, then kicking out of the head character of a character string, repeating step 4; As until last character surplus, all do not complete coupling, so, perform step 6;
Step 4: kick out of the first character that this node refers to character string, mates DFA by remaining character string, the node met as found, and using this node as faulty target, performs step 5, returns step 3; As do not found, also return step 3;
Step 5: the inefficacy ownership set finding faulty target, traversal wherein the quoting of all nodes, checking should using this node as its faulty target node with or without node, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Step 6: node is increased to the character set object set place of DFA head, the character set object of character is referred to if any this node, add this node pointer in corresponding character set object, and travel through quoting of wherein node object, check whether that this node should arrange as faulty target by object, if any, this node pointer is added in the inefficacy ownership set of this node, and the faulty target of this node is set to this node;
Described deletion feature string comprises the following steps:
Step 7: carry out the deletion work to character string successively from back to front, repeated execution of steps 8, until step 8 does not return;
Step 8: find this corresponding node, as this node does not have child, deletes this node, and returns step 7.
CN201210312478.9A 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm Expired - Fee Related CN102867036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210312478.9A CN102867036B (en) 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210312478.9A CN102867036B (en) 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm

Publications (2)

Publication Number Publication Date
CN102867036A CN102867036A (en) 2013-01-09
CN102867036B true CN102867036B (en) 2015-03-04

Family

ID=47445905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210312478.9A Expired - Fee Related CN102867036B (en) 2012-08-29 2012-08-29 Improved method for dynamic generation of data structure for Aho-Corasick algorithm

Country Status (1)

Country Link
CN (1) CN102867036B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106067039B (en) * 2016-05-30 2019-01-29 桂林电子科技大学 Method for mode matching based on decision tree beta pruning
CN107885492B (en) * 2017-11-14 2021-03-12 中国银行股份有限公司 Method and device for dynamically generating data structure in host

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551803A (en) * 2008-03-31 2009-10-07 华为技术有限公司 Method and device for establishing pattern matching state machine and pattern recognition
CN101556619A (en) * 2009-05-04 2009-10-14 成都市华为赛门铁克科技有限公司 Node compression method and device thereof and multimode matching method and device thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2437560A (en) * 2006-04-28 2007-10-31 Roke Manor Research Constructing Aho Corasick trees
US7725510B2 (en) * 2006-08-01 2010-05-25 Alcatel-Lucent Usa Inc. Method and system for multi-character multi-pattern pattern matching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551803A (en) * 2008-03-31 2009-10-07 华为技术有限公司 Method and device for establishing pattern matching state machine and pattern recognition
CN101556619A (en) * 2009-05-04 2009-10-14 成都市华为赛门铁克科技有限公司 Node compression method and device thereof and multimode matching method and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种快速高效的模式匹配算法的应用研究;王杰 等;《计算机工程与应用》;20081111;93-95,185 *

Also Published As

Publication number Publication date
CN102867036A (en) 2013-01-09

Similar Documents

Publication Publication Date Title
US11681702B2 (en) Conversion of model views into relational models
US9495207B1 (en) Cataloging data sets for reuse in pipeline applications
CN105389349B (en) Dictionary update method and device
CA2562281C (en) Partial query caching
CN105574054B (en) A kind of distributed caching range query method, apparatus and system
US11176159B1 (en) Systems and methods for data analytics
KR101617696B1 (en) Method and device for mining data regular expression
CN102033748A (en) Method for generating data processing flow codes
US12001425B2 (en) Duplication elimination in depth based searches for distributed systems
CN106611037A (en) Method and device for distributed diagram calculation
TWI706260B (en) Index establishment method and device based on mobile terminal NoSQL database
CN101916281B (en) Concurrent computational system and non-repetition counting method
CN105471893B (en) A kind of distributed equivalent data flow connection method
CN106227799A (en) A kind of sql statement processing method based on distributed data base
EP3717997A1 (en) Cardinality estimation in databases
CN103309873B (en) The processing method of data, apparatus and system
CN102867036B (en) Improved method for dynamic generation of data structure for Aho-Corasick algorithm
CN106156171A (en) A kind of enquiring and optimizing method of Virtual asset data
WO2016177027A1 (en) Batch data query method and device
KR101955376B1 (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
CN103092960A (en) Method for building software product feature tree model based on demand cluster
CN105608201A (en) Text matching method supporting multi-keyword expression
CN107679240B (en) Virtual identity mining method
CN105357177A (en) Method for processing data packet filtering rule set and data packet matching method
CN106445968A (en) Data merging method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150304

Termination date: 20170829

CF01 Termination of patent right due to non-payment of annual fee