CN107193776A - A kind of new transfer algorithm for matching regular expressions - Google Patents

A kind of new transfer algorithm for matching regular expressions Download PDF

Info

Publication number
CN107193776A
CN107193776A CN201710396008.8A CN201710396008A CN107193776A CN 107193776 A CN107193776 A CN 107193776A CN 201710396008 A CN201710396008 A CN 201710396008A CN 107193776 A CN107193776 A CN 107193776A
Authority
CN
China
Prior art keywords
regular expression
symbol
node
algorithm
regular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710396008.8A
Other languages
Chinese (zh)
Inventor
王中风
于怀竹
林军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710396008.8A priority Critical patent/CN107193776A/en
Publication of CN107193776A publication Critical patent/CN107193776A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems

Abstract

The invention discloses a kind of method that novel structure algorithm is used for matching regular expressions.Comprise the following steps:Step one, any regular expression in Software Create PCR-based E rule sets is passed through;Step 2, analyzes regular expression, is translated into regular expression tree;Regular expression analytic tree form is converted to chain sheet form by step 3, continuation;Step 4, travels through the regular expression of chain sheet form, using essential structure rule of the algorithm to character, handles each node in chained list, generates the finite state machine of regular expression;Step 5, by the finite state machine of regular expression, can generate corresponding circuit structure, realize a regular expression compiler.Algorithm in the present invention can be with any regular expression in transformation rule, and the finite state machine generated reduces substantial amounts of intermediate node states, simplifies circuit structure compared to conventional conversion algorithm, the matching regular expressions circuit suitable for FPGA is finally given, with certain novelty.

Description

A kind of new transfer algorithm for matching regular expressions
Technical field
Be converted to the present invention relates to the conversion method in a kind of matching regular expressions field, especially regular expression Algorithm with the construction finite state machine arrived used in the process of engine.
Background technology
With internet traffic rapid growth and pass through network collection, exchange data diffusion, a large amount of regular expressions Pattern, which with high band wide data match, causes matching regular expressions to turn into the main bottleneck in application.The network bandwidth is increasingly While growth gives people to provide convenient, the risk of network attack is also significantly increased.Matching regular expressions are in text The process of all substrings is found in document, our general use patterns (pattern) describe a regular expression.Due to just Then expression formula matching (NIDSs) in intruding detection system plays vital role, such as SNORT, L7-Filter etc. The intruding detection system increased income, the inside has just largely used matching regular expressions.Simultaneously because the continuous hair of semiconductor technology Exhibition, FPGA has had MPP unit, the on-chip memory of Large Copacity, therefore realizes canonical table using FPGA One of hardware-accelerated focus studied as current people matched up to formula.Meanwhile, an effective transfer algorithm is proposed, from just Then expression formula conversion finite state machine, then seem more and more important to suitable for FPGA matching regular expressions circuits.
The content of the invention
The present invention is based on above mentioned problem, it is proposed that a kind of New Algorithm that regular expression is converted into finite state machine, The regular expression of any one PCR-based E rule sets can be converted into finite state machine by the algorithm, then generation module The matching regular expressions engine circuit suitable for FPGA.The algorithm has an efficiency high, the advantages of intermediate node of generation is few, The circuit structure of generation is fairly simple simultaneously, can reduce the occupancy of resource on FPGA.The algorithm is used for matching regular expressions The implementation process of circuit, mainly comprising following steps:
Step one:The regular expression of any one PCR-based E rule sets is provided, the rule in the expression formula is analyzed, will It is converted into the form of regular expression analytic tree;
Step 2:Regular expression tree after conversion is then converted to chained list (token lsit) form;
Step 3:Then using the basic symbol construction rule of New Algorithm, ergodic decomposition is the canonical table of chain sheet form Up to formula, finite state machine is converted it to;
Step 4, by finite state machine, using the circuit unit of basic symbol in regular expression, is converted to and is applied to FPGA module engine circuit;
In the step of the algorithm of proposition, step one specific implementation is:
1) in PCRE rule sets, any regular expression of PCRE rules is included with Software Create, is now generated just Then expression formula has generality;
2) what is and then successively included in analysis regular expression is regular one by one, generates corresponding regular expression parsing tree-like Formula;In the step of the algorithm of proposition, the content that step 2 is included is:
1) because the regular expression analytic tree form character match in step one repeats that a large amount of identical nodes can be produced, The increased analytic tree of non-essential depth is generated, recurrence number of times is added;And the not fully extension of support Perl grammers, Therefore need to be converted to the chain sheet form of regular expression;
2) after changing in regular expression chain sheet form, each chained list node includes val, rep, next, child tetra- Part is constituted, and val parts are generally character type, bracket or union of symbol (|), extend PCRE taxemes;Rep parts one As for limit number of repetition and closure symbol (*);Next parts are general in the next node of chained list, regular expression to point to Next chained list node is connected using bound symbol () or with union of symbol (|);Child parts are the regular expression chain The subchain matrix section of table..
The algorithm steps specifically proposed in step 3 can handle the various symbols in specific regular language, in following Hold:
1) the state diagram construction of basic symbol of the algorithm comprising regular expression, for example:Bound symbol () State Transferring Figure, union of symbol (|) state transition graph closes symbol (*) state transition graph, and limit number of repetitionSign condition is changed Figure ,+number state transition graph and { m, n } sign condition transition diagram;
2) in theory any one regular expression State Transformer can by the basic symbol described in 1) shape State figure is constituted, it is possible to achieve modular design, is started by first node for traveling through regular expression chained list, using carrying The algorithm process chained list node information gone out, constructs finite state digraph, terminates to last node, ultimately generate regular expression Finite state machine;
3) subprogram of the arthmetic statement can handle main following four part:
First, the val parts of the present node in chained list are union of symbol | (i.e. tcur.val=" | ") when, and the node Possess sub- regular expression, i.e. child parts (tcur.child), then the node and father node of sub- regular expression pass through sky Character is changed to connect;
2nd, to the symbol (+) with closure symbol (*) or restriction number of times, it is necessary to which creating a puppet state p temporarily carrys out generation The source state of table feedback loop, can delete the pseudo- state p afterwards;
3rd, the node after being directly passed to the input NUL State Transferring with closure symbol (*) chained list node;
4th, the state that engine is matched for NFA can be directly changed into by limiting the number of times repeated.
Brief description of the drawings
Do further specific description to the present invention with reference to the accompanying drawings and detailed description, the present invention it is above-mentioned and its Advantage in terms of him will become apparent.Using regular expression as a { 4 } b* (c [ab] in figure| d) exemplified by+f.
Fig. 1 is the form that given regular expression is converted to left analytic tree;
Fig. 2 is that the regular expression of analytic tree form is converted into chain sheet form;
Fig. 3 is the construction rule that character and symbol figure are represented in the transfer algorithm, and (a) represents bound symbol, (b) table Show to combine and meet, (c) represents the building method of question mark in regular expression, (d) represents closure symbol, and (e) represents regular expressions The construction of plus sige in formula, (f) represents to repeat m to n character building methods.
Fig. 4 is the transfer algorithm by proposition, the regular expression finite state machine ultimately generated;
Fig. 5 represents some explanations of nouns in the transfer algorithm.
Embodiment
The specific implementation to the present invention elaborates below in conjunction with the accompanying drawings, and core thinking of the invention is exactly by proposing Algorithm in regular expression character and symbol essential structure rule, and traversal chain sheet form regular expression, obtain Regular expression finite state machine.
Regular expression is converted to the finite state machine of uncertainty, and required step is as follows:
Step one:The present invention using generating a regular expression in PCRE rule sets at random from writing software program, This sentences regular expression a { 4 } b* (c [ab]| d) exemplified by+f;
Step 2:The regular expression is analyzed, regular expression is handled, regular expressions as shown in Figure 1 are obtained The form of formula analytic tree, because analytic tree form generates non-essential tree depth increase and the increase of recurrence number of times, in addition it is also necessary to Proceed conversion;
Step 3:The regular expression of analytic tree form is observed, as shown in Fig. 2 continuing to convert it to two-dimensional chain table shape Formula, each chained list node includes val, and rep, next, tetra- parts of child are constituted, and the regular expression of the chain sheet form is tighter Gather and be easier traversal;
Step 4:According to regular expression chain sheet form, each node in traversal chained list, being translated into corresponding has Limit state machine.
The step 4 includes following process step, is analyzed with reference to Fig. 5:
(1) when traversing some current node in chained list, as shown in figure 5, representing present node using tcur, Spre is represented All set of the preceding state of present node status;
(2) judge present node val part (i.e. tcur.val) whether be symbol " | ", if tcur.val==| be it is true, Need (1) step in the child list (i.e. tcur.child) of traversal present node, repeat step four;
(3) if tcur.val==in step (2) | be false, it is necessary to the rep parts of auxiliary judgment present node (i.e. tcur.rep);
(4) according to tcur.rep, if it is * symbols or+symbol, one puppet state p of establishment is now needed (to use Create_state (p)), if present node val parts be bracket, it is necessary to handle the child list of present node, go to step (1), else if val parts are character, the character mode is created, then needs to delete pseudo- state p and (uses delete_state (p));
(5) according to tcur.rep, if it is { m, n } symbol, directly according to hair (f) figure structural regime in Fig. 3;
(6) next node (tcur.child) of present node, repeat step (1) are traveled through.
The explanation of nouns that is included in step 4 as shown in figure 5, can be realized in this step by two recursive programs of interaction, Program one can be write to handle the entirety of chained list interior joint, node is realized to the conversion of finite state digraph, it is used for handling symbol Number joint (|), closure (*) and limit repeatedly ({ m, n });Then program two is write, is mainly used to handle the val portions of chained list node Divide, usually processing character class or bracket.Therefore step 4 can handle the arbitrary word in regular expression chain sheet form Symbol, for changing the regular expression chain sheet form of arbitrary structures, finally gives the corresponding finite state machine of regular expression. There are two especially important concepts most important to the regular expression circuit of constructing module in Fig. 5 explanation of nouns.First Individual is exactly the preceding state set Spre of current state, and it is all of current state that it, which is contained by NUL State Transferring, The collection of state node, can effectively reduce the intermediateness of finite state machine before.Second special concept is exactly to bag Pseudo- state p foundation for node containing closure, it can serve as interim occupy-place for closure symbol (*) the source state of feeding back to Symbol, then during algorithm process, pseudo- state p can be replaced by source state set.The regular expression of chain sheet form passes through After the algorithm traversal processing, using the algorithm to character and symbol essential structure rule, it can obtain as shown in Figure 4 limited State machine.
Step 5:After the finite state machine that regular expression is obtained by algorithm, we, which can just generate, is mutually applied to just The circuit that then expression formula is matched.
In summary, in general regular expression compiler implementation process, regular expression is converted to by chain sheet form The algorithm that finite state machine is used is different from the algorithm that the present invention is used, and this algorithm proposes that conversion method can have been reduced effectively Limit state machine generating process in intermediate node states quantity, simplify the regular expression circuit structure of generation, can quickly by Regular expression is converted into circuit structure, and is mapped exactly on FPGA.
Regular expression is changed to finite state machine the invention provides a kind of New Algorithm, and specific transfer process can be with By a variety of modes, the transfer algorithm that the present invention is illustrated is one of which preferred embodiment.And the above embodiments are examples Property, it is impossible to limitation of the present invention is interpreted as, for those skilled in the art, the present invention is not being departed from On the premise of principle, some improvements and modifications can be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (5)

1. utilize this novel structure algorithm to be used for the method for matching regular expressions, it is characterised in that to comprise the steps of:
Step one, any regular expression is converted to the analytic tree form of regular expression;
Step 2, chain sheet form is decomposed into using the regular expression analytic tree form after conversion again;
Step 3, then using basic symbol construction rule in the transfer algorithm proposed, ergodic decomposition is the canonical of chain sheet form Expression formula, converts it to finite state machine;
Step 4, by finite state machine, using the circuit unit of basic symbol in regular expression, is converted to suitable for FPGA Modularization engine circuit.
2. according to the step one in right 1, comprise the steps of:
1) any one regular expression in Software Create PCR-based E rule sets is passed through;
2) given regular expression is analyzed, corresponding regular expression analytic tree form is converted into.
3. according to step 2 in right 1, comprise the steps of:
1) because the regular expression analytic tree form character match in step one repeats that a large amount of identical nodes can be produced, produce Non-essential depth increased analytic tree, adds recurrence number of times;And the extension of Perl grammers is not fully supported, therefore Need to be converted to the chain sheet form of regular expression;
2) after changing in regular expression chain sheet form, each chained list node is by val, rep, next, tetra- part groups of child Into val parts are generally character type, bracket or union of symbol (|), extend PCRE taxemes;Rep parts are typically limited Determine number of repetition ({ m, n }) and closure symbol (*);Next parts are one in the next node for pointing to chained list, regular expression As use with bound symbol () or with union of symbol (|) connect next node;Child parts are the regular expression chain The subchain matrix section of table.
4. according to step 3 in claim 1, propose the content that New Algorithm includes following aspect:
1) the state diagram construction rule of basic symbol of the algorithm comprising regular expression, bound symbol () state transition graph, joint Symbol (|) state transition graph, closes symbol (*) state transition graph, and limit number of repetitionSign condition transition diagram ,+number shape State transition diagram and { m, n } sign condition transition diagram;
2) any one regular expression state transition graph can be by the basic state diagram institute structure described in 1) in theory Into, can with it is modular design matching regular expressions circuit.Started by first node for traveling through regular expression chained list, Using the algorithm process chained list node information of proposition, finite state digraph is constructed, terminates to last node, ultimately generates canonical Expression formula finite state machine;
3) processing of the algorithm proposed to chained list node information mainly includes following four part:
First, the val parts of the present node in chained list are union of symbol | (i.e. tcur.val=" | ") when, and the node possesses Sub- regular expression, i.e. child part (tcur.child), then the node and father node of sub- regular expression pass through NUL Change to connect;
2nd, to the symbol (+) with closure symbol (*) or restriction number of times, it is necessary to create a puppet state p temporarily to represent instead The source state of ring is presented, the pseudo- state can be deleted afterwards;
3rd, to closure symbol (*) chained list node, NUL conversion is inputted by the node after being directly passed to it;
4th, the state that engine is matched for NFA can be directly changed into by limiting the number of times repeated.
5. the finite state machine that the algorithm according to described in right 4 is ultimately converted to, is that comparison is succinct, is easy to generation module The matching regular expressions circuit of change.
CN201710396008.8A 2017-05-24 2017-05-24 A kind of new transfer algorithm for matching regular expressions Pending CN107193776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710396008.8A CN107193776A (en) 2017-05-24 2017-05-24 A kind of new transfer algorithm for matching regular expressions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710396008.8A CN107193776A (en) 2017-05-24 2017-05-24 A kind of new transfer algorithm for matching regular expressions

Publications (1)

Publication Number Publication Date
CN107193776A true CN107193776A (en) 2017-09-22

Family

ID=59876294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710396008.8A Pending CN107193776A (en) 2017-05-24 2017-05-24 A kind of new transfer algorithm for matching regular expressions

Country Status (1)

Country Link
CN (1) CN107193776A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110324204A (en) * 2019-07-01 2019-10-11 中国人民解放军陆军工程大学 A kind of high speed regular expression matching engine realized in FPGA and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201836A (en) * 2007-09-04 2008-06-18 浙江大学 Method for matching in speedup regular expression based on finite automaton containing memorization determination
US20100138367A1 (en) * 2007-08-02 2010-06-03 Nario Yamagaki SYSTEM, METHOD, AND PROGRAM FOR GENERATING NON-DETERMINISTIC FINITE AUTOMATON NOT INCLUDING e-TRANSITION
US20110022617A1 (en) * 2008-03-19 2011-01-27 Norio Yamagaki Finite automaton generation system for string matching for multi-byte processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138367A1 (en) * 2007-08-02 2010-06-03 Nario Yamagaki SYSTEM, METHOD, AND PROGRAM FOR GENERATING NON-DETERMINISTIC FINITE AUTOMATON NOT INCLUDING e-TRANSITION
CN101201836A (en) * 2007-09-04 2008-06-18 浙江大学 Method for matching in speedup regular expression based on finite automaton containing memorization determination
US20110022617A1 (en) * 2008-03-19 2011-01-27 Norio Yamagaki Finite automaton generation system for string matching for multi-byte processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
敬茂华,杨义先,于长永,辛阳: "一种构造正则表达式更小ε-NFA的方法", 《东北大学学报(自然科学版)》 *
殷珍珍: "《基于正则表达式的多模式匹配算法研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
胥清化: "《基于正则表达式的高速协议识别研究与实现》", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110324204A (en) * 2019-07-01 2019-10-11 中国人民解放军陆军工程大学 A kind of high speed regular expression matching engine realized in FPGA and method
CN110324204B (en) * 2019-07-01 2020-09-11 中国人民解放军陆军工程大学 High-speed regular expression matching engine and method implemented in FPGA (field programmable Gate array)

Similar Documents

Publication Publication Date Title
Baez et al. Categories in control
Might et al. Parsing with derivatives: a functional pearl
Whigham A schema theorem for context-free grammars
KR20110062084A (en) Hybrid translation apparatus and its method
CN106779225A (en) A kind of optimal path method for solving comprising Dominator collection
CN109189393A (en) Method for processing business and device
Vieira et al. Temporal correlations in the simplest measurement sequences
CN107193776A (en) A kind of new transfer algorithm for matching regular expressions
Shaw Picture graphs, grammars, and parsing
Melnyk et al. Grapher: Multi-stage knowledge graph construction using pretrained language models
JP2001137788A5 (en)
Lin et al. Random walk on knot diagrams, colored Jones polynomial and Ihara-Selberg zeta function
Siklóssy et al. Breadth-first search: some surprising results
Indu Technique for conversion of regular expression to and from finite automata
JP6629259B2 (en) Dialog scenario generation apparatus, method, and program
CN111857728B (en) Code abstract generation method and device
Yi et al. Fault tree data structure based on XML and the conversion method to BDD
CN102902809B (en) A kind of Novel semantic association method for digging
Thornton Does conceptual compositionality affect language complexity? Comment on Lou-Magnuson and Onnis
Ermel et al. Modeling multicasting in communication spaces by reconfigurable high-level Petri nets
Ezhilarasu et al. A Novel Approach to Classify Nondeterministic Finite Automata Based on Single Loop and its Position
Yan et al. Multi-drop path model for multicast routing and wavelength assignment
Briskilal et al. An effective enactment of broadcasting XML in wireless mobile environment
CN106936716A (en) A kind of TTP parsings conversion method, forwarding-table item sending method and device
Balhorn et al. Data augmentation for machine learning of chemical process flowsheets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170922

WD01 Invention patent application deemed withdrawn after publication