CN107193776A - A kind of new transfer algorithm for matching regular expressions - Google Patents
A kind of new transfer algorithm for matching regular expressions Download PDFInfo
- Publication number
- CN107193776A CN107193776A CN201710396008.8A CN201710396008A CN107193776A CN 107193776 A CN107193776 A CN 107193776A CN 201710396008 A CN201710396008 A CN 201710396008A CN 107193776 A CN107193776 A CN 107193776A
- Authority
- CN
- China
- Prior art keywords
- regular expression
- symbol
- node
- algorithm
- regular
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
Abstract
The invention discloses a kind of method that novel structure algorithm is used for matching regular expressions.Comprise the following steps:Step one, any regular expression in Software Create PCR-based E rule sets is passed through;Step 2, analyzes regular expression, is translated into regular expression tree;Regular expression analytic tree form is converted to chain sheet form by step 3, continuation;Step 4, travels through the regular expression of chain sheet form, using essential structure rule of the algorithm to character, handles each node in chained list, generates the finite state machine of regular expression;Step 5, by the finite state machine of regular expression, can generate corresponding circuit structure, realize a regular expression compiler.Algorithm in the present invention can be with any regular expression in transformation rule, and the finite state machine generated reduces substantial amounts of intermediate node states, simplifies circuit structure compared to conventional conversion algorithm, the matching regular expressions circuit suitable for FPGA is finally given, with certain novelty.
Description
Technical field
Be converted to the present invention relates to the conversion method in a kind of matching regular expressions field, especially regular expression
Algorithm with the construction finite state machine arrived used in the process of engine.
Background technology
With internet traffic rapid growth and pass through network collection, exchange data diffusion, a large amount of regular expressions
Pattern, which with high band wide data match, causes matching regular expressions to turn into the main bottleneck in application.The network bandwidth is increasingly
While growth gives people to provide convenient, the risk of network attack is also significantly increased.Matching regular expressions are in text
The process of all substrings is found in document, our general use patterns (pattern) describe a regular expression.Due to just
Then expression formula matching (NIDSs) in intruding detection system plays vital role, such as SNORT, L7-Filter etc.
The intruding detection system increased income, the inside has just largely used matching regular expressions.Simultaneously because the continuous hair of semiconductor technology
Exhibition, FPGA has had MPP unit, the on-chip memory of Large Copacity, therefore realizes canonical table using FPGA
One of hardware-accelerated focus studied as current people matched up to formula.Meanwhile, an effective transfer algorithm is proposed, from just
Then expression formula conversion finite state machine, then seem more and more important to suitable for FPGA matching regular expressions circuits.
The content of the invention
The present invention is based on above mentioned problem, it is proposed that a kind of New Algorithm that regular expression is converted into finite state machine,
The regular expression of any one PCR-based E rule sets can be converted into finite state machine by the algorithm, then generation module
The matching regular expressions engine circuit suitable for FPGA.The algorithm has an efficiency high, the advantages of intermediate node of generation is few,
The circuit structure of generation is fairly simple simultaneously, can reduce the occupancy of resource on FPGA.The algorithm is used for matching regular expressions
The implementation process of circuit, mainly comprising following steps:
Step one:The regular expression of any one PCR-based E rule sets is provided, the rule in the expression formula is analyzed, will
It is converted into the form of regular expression analytic tree;
Step 2:Regular expression tree after conversion is then converted to chained list (token lsit) form;
Step 3:Then using the basic symbol construction rule of New Algorithm, ergodic decomposition is the canonical table of chain sheet form
Up to formula, finite state machine is converted it to;
Step 4, by finite state machine, using the circuit unit of basic symbol in regular expression, is converted to and is applied to
FPGA module engine circuit;
In the step of the algorithm of proposition, step one specific implementation is:
1) in PCRE rule sets, any regular expression of PCRE rules is included with Software Create, is now generated just
Then expression formula has generality;
2) what is and then successively included in analysis regular expression is regular one by one, generates corresponding regular expression parsing tree-like
Formula;In the step of the algorithm of proposition, the content that step 2 is included is:
1) because the regular expression analytic tree form character match in step one repeats that a large amount of identical nodes can be produced,
The increased analytic tree of non-essential depth is generated, recurrence number of times is added;And the not fully extension of support Perl grammers,
Therefore need to be converted to the chain sheet form of regular expression;
2) after changing in regular expression chain sheet form, each chained list node includes val, rep, next, child tetra-
Part is constituted, and val parts are generally character type, bracket or union of symbol (|), extend PCRE taxemes;Rep parts one
As for limit number of repetition and closure symbol (*);Next parts are general in the next node of chained list, regular expression to point to
Next chained list node is connected using bound symbol () or with union of symbol (|);Child parts are the regular expression chain
The subchain matrix section of table..
The algorithm steps specifically proposed in step 3 can handle the various symbols in specific regular language, in following
Hold:
1) the state diagram construction of basic symbol of the algorithm comprising regular expression, for example:Bound symbol () State Transferring
Figure, union of symbol (|) state transition graph closes symbol (*) state transition graph, and limit number of repetitionSign condition is changed
Figure ,+number state transition graph and { m, n } sign condition transition diagram;
2) in theory any one regular expression State Transformer can by the basic symbol described in 1) shape
State figure is constituted, it is possible to achieve modular design, is started by first node for traveling through regular expression chained list, using carrying
The algorithm process chained list node information gone out, constructs finite state digraph, terminates to last node, ultimately generate regular expression
Finite state machine;
3) subprogram of the arthmetic statement can handle main following four part:
First, the val parts of the present node in chained list are union of symbol | (i.e. tcur.val=" | ") when, and the node
Possess sub- regular expression, i.e. child parts (tcur.child), then the node and father node of sub- regular expression pass through sky
Character is changed to connect;
2nd, to the symbol (+) with closure symbol (*) or restriction number of times, it is necessary to which creating a puppet state p temporarily carrys out generation
The source state of table feedback loop, can delete the pseudo- state p afterwards;
3rd, the node after being directly passed to the input NUL State Transferring with closure symbol (*) chained list node;
4th, the state that engine is matched for NFA can be directly changed into by limiting the number of times repeated.
Brief description of the drawings
Do further specific description to the present invention with reference to the accompanying drawings and detailed description, the present invention it is above-mentioned and its
Advantage in terms of him will become apparent.Using regular expression as a { 4 } b* (c [ab] in figure| d) exemplified by+f.
Fig. 1 is the form that given regular expression is converted to left analytic tree;
Fig. 2 is that the regular expression of analytic tree form is converted into chain sheet form;
Fig. 3 is the construction rule that character and symbol figure are represented in the transfer algorithm, and (a) represents bound symbol, (b) table
Show to combine and meet, (c) represents the building method of question mark in regular expression, (d) represents closure symbol, and (e) represents regular expressions
The construction of plus sige in formula, (f) represents to repeat m to n character building methods.
Fig. 4 is the transfer algorithm by proposition, the regular expression finite state machine ultimately generated;
Fig. 5 represents some explanations of nouns in the transfer algorithm.
Embodiment
The specific implementation to the present invention elaborates below in conjunction with the accompanying drawings, and core thinking of the invention is exactly by proposing
Algorithm in regular expression character and symbol essential structure rule, and traversal chain sheet form regular expression, obtain
Regular expression finite state machine.
Regular expression is converted to the finite state machine of uncertainty, and required step is as follows:
Step one:The present invention using generating a regular expression in PCRE rule sets at random from writing software program,
This sentences regular expression a { 4 } b* (c [ab]| d) exemplified by+f;
Step 2:The regular expression is analyzed, regular expression is handled, regular expressions as shown in Figure 1 are obtained
The form of formula analytic tree, because analytic tree form generates non-essential tree depth increase and the increase of recurrence number of times, in addition it is also necessary to
Proceed conversion;
Step 3:The regular expression of analytic tree form is observed, as shown in Fig. 2 continuing to convert it to two-dimensional chain table shape
Formula, each chained list node includes val, and rep, next, tetra- parts of child are constituted, and the regular expression of the chain sheet form is tighter
Gather and be easier traversal;
Step 4:According to regular expression chain sheet form, each node in traversal chained list, being translated into corresponding has
Limit state machine.
The step 4 includes following process step, is analyzed with reference to Fig. 5:
(1) when traversing some current node in chained list, as shown in figure 5, representing present node using tcur, Spre is represented
All set of the preceding state of present node status;
(2) judge present node val part (i.e. tcur.val) whether be symbol " | ", if tcur.val==| be it is true,
Need (1) step in the child list (i.e. tcur.child) of traversal present node, repeat step four;
(3) if tcur.val==in step (2) | be false, it is necessary to the rep parts of auxiliary judgment present node (i.e.
tcur.rep);
(4) according to tcur.rep, if it is * symbols or+symbol, one puppet state p of establishment is now needed (to use
Create_state (p)), if present node val parts be bracket, it is necessary to handle the child list of present node, go to step
(1), else if val parts are character, the character mode is created, then needs to delete pseudo- state p and (uses delete_state
(p));
(5) according to tcur.rep, if it is { m, n } symbol, directly according to hair (f) figure structural regime in Fig. 3;
(6) next node (tcur.child) of present node, repeat step (1) are traveled through.
The explanation of nouns that is included in step 4 as shown in figure 5, can be realized in this step by two recursive programs of interaction,
Program one can be write to handle the entirety of chained list interior joint, node is realized to the conversion of finite state digraph, it is used for handling symbol
Number joint (|), closure (*) and limit repeatedly ({ m, n });Then program two is write, is mainly used to handle the val portions of chained list node
Divide, usually processing character class or bracket.Therefore step 4 can handle the arbitrary word in regular expression chain sheet form
Symbol, for changing the regular expression chain sheet form of arbitrary structures, finally gives the corresponding finite state machine of regular expression.
There are two especially important concepts most important to the regular expression circuit of constructing module in Fig. 5 explanation of nouns.First
Individual is exactly the preceding state set Spre of current state, and it is all of current state that it, which is contained by NUL State Transferring,
The collection of state node, can effectively reduce the intermediateness of finite state machine before.Second special concept is exactly to bag
Pseudo- state p foundation for node containing closure, it can serve as interim occupy-place for closure symbol (*) the source state of feeding back to
Symbol, then during algorithm process, pseudo- state p can be replaced by source state set.The regular expression of chain sheet form passes through
After the algorithm traversal processing, using the algorithm to character and symbol essential structure rule, it can obtain as shown in Figure 4 limited
State machine.
Step 5:After the finite state machine that regular expression is obtained by algorithm, we, which can just generate, is mutually applied to just
The circuit that then expression formula is matched.
In summary, in general regular expression compiler implementation process, regular expression is converted to by chain sheet form
The algorithm that finite state machine is used is different from the algorithm that the present invention is used, and this algorithm proposes that conversion method can have been reduced effectively
Limit state machine generating process in intermediate node states quantity, simplify the regular expression circuit structure of generation, can quickly by
Regular expression is converted into circuit structure, and is mapped exactly on FPGA.
Regular expression is changed to finite state machine the invention provides a kind of New Algorithm, and specific transfer process can be with
By a variety of modes, the transfer algorithm that the present invention is illustrated is one of which preferred embodiment.And the above embodiments are examples
Property, it is impossible to limitation of the present invention is interpreted as, for those skilled in the art, the present invention is not being departed from
On the premise of principle, some improvements and modifications can be made, these improvements and modifications also should be regarded as protection scope of the present invention.
Claims (5)
1. utilize this novel structure algorithm to be used for the method for matching regular expressions, it is characterised in that to comprise the steps of:
Step one, any regular expression is converted to the analytic tree form of regular expression;
Step 2, chain sheet form is decomposed into using the regular expression analytic tree form after conversion again;
Step 3, then using basic symbol construction rule in the transfer algorithm proposed, ergodic decomposition is the canonical of chain sheet form
Expression formula, converts it to finite state machine;
Step 4, by finite state machine, using the circuit unit of basic symbol in regular expression, is converted to suitable for FPGA
Modularization engine circuit.
2. according to the step one in right 1, comprise the steps of:
1) any one regular expression in Software Create PCR-based E rule sets is passed through;
2) given regular expression is analyzed, corresponding regular expression analytic tree form is converted into.
3. according to step 2 in right 1, comprise the steps of:
1) because the regular expression analytic tree form character match in step one repeats that a large amount of identical nodes can be produced, produce
Non-essential depth increased analytic tree, adds recurrence number of times;And the extension of Perl grammers is not fully supported, therefore
Need to be converted to the chain sheet form of regular expression;
2) after changing in regular expression chain sheet form, each chained list node is by val, rep, next, tetra- part groups of child
Into val parts are generally character type, bracket or union of symbol (|), extend PCRE taxemes;Rep parts are typically limited
Determine number of repetition ({ m, n }) and closure symbol (*);Next parts are one in the next node for pointing to chained list, regular expression
As use with bound symbol () or with union of symbol (|) connect next node;Child parts are the regular expression chain
The subchain matrix section of table.
4. according to step 3 in claim 1, propose the content that New Algorithm includes following aspect:
1) the state diagram construction rule of basic symbol of the algorithm comprising regular expression, bound symbol () state transition graph, joint
Symbol (|) state transition graph, closes symbol (*) state transition graph, and limit number of repetitionSign condition transition diagram ,+number shape
State transition diagram and { m, n } sign condition transition diagram;
2) any one regular expression state transition graph can be by the basic state diagram institute structure described in 1) in theory
Into, can with it is modular design matching regular expressions circuit.Started by first node for traveling through regular expression chained list,
Using the algorithm process chained list node information of proposition, finite state digraph is constructed, terminates to last node, ultimately generates canonical
Expression formula finite state machine;
3) processing of the algorithm proposed to chained list node information mainly includes following four part:
First, the val parts of the present node in chained list are union of symbol | (i.e. tcur.val=" | ") when, and the node possesses
Sub- regular expression, i.e. child part (tcur.child), then the node and father node of sub- regular expression pass through NUL
Change to connect;
2nd, to the symbol (+) with closure symbol (*) or restriction number of times, it is necessary to create a puppet state p temporarily to represent instead
The source state of ring is presented, the pseudo- state can be deleted afterwards;
3rd, to closure symbol (*) chained list node, NUL conversion is inputted by the node after being directly passed to it;
4th, the state that engine is matched for NFA can be directly changed into by limiting the number of times repeated.
5. the finite state machine that the algorithm according to described in right 4 is ultimately converted to, is that comparison is succinct, is easy to generation module
The matching regular expressions circuit of change.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710396008.8A CN107193776A (en) | 2017-05-24 | 2017-05-24 | A kind of new transfer algorithm for matching regular expressions |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710396008.8A CN107193776A (en) | 2017-05-24 | 2017-05-24 | A kind of new transfer algorithm for matching regular expressions |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107193776A true CN107193776A (en) | 2017-09-22 |
Family
ID=59876294
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710396008.8A Pending CN107193776A (en) | 2017-05-24 | 2017-05-24 | A kind of new transfer algorithm for matching regular expressions |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107193776A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110324204A (en) * | 2019-07-01 | 2019-10-11 | 中国人民解放军陆军工程大学 | A kind of high speed regular expression matching engine realized in FPGA and method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101201836A (en) * | 2007-09-04 | 2008-06-18 | 浙江大学 | Method for matching in speedup regular expression based on finite automaton containing memorization determination |
US20100138367A1 (en) * | 2007-08-02 | 2010-06-03 | Nario Yamagaki | SYSTEM, METHOD, AND PROGRAM FOR GENERATING NON-DETERMINISTIC FINITE AUTOMATON NOT INCLUDING e-TRANSITION |
US20110022617A1 (en) * | 2008-03-19 | 2011-01-27 | Norio Yamagaki | Finite automaton generation system for string matching for multi-byte processing |
-
2017
- 2017-05-24 CN CN201710396008.8A patent/CN107193776A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100138367A1 (en) * | 2007-08-02 | 2010-06-03 | Nario Yamagaki | SYSTEM, METHOD, AND PROGRAM FOR GENERATING NON-DETERMINISTIC FINITE AUTOMATON NOT INCLUDING e-TRANSITION |
CN101201836A (en) * | 2007-09-04 | 2008-06-18 | 浙江大学 | Method for matching in speedup regular expression based on finite automaton containing memorization determination |
US20110022617A1 (en) * | 2008-03-19 | 2011-01-27 | Norio Yamagaki | Finite automaton generation system for string matching for multi-byte processing |
Non-Patent Citations (3)
Title |
---|
敬茂华,杨义先,于长永,辛阳: "一种构造正则表达式更小ε-NFA的方法", 《东北大学学报(自然科学版)》 * |
殷珍珍: "《基于正则表达式的多模式匹配算法研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
胥清化: "《基于正则表达式的高速协议识别研究与实现》", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110324204A (en) * | 2019-07-01 | 2019-10-11 | 中国人民解放军陆军工程大学 | A kind of high speed regular expression matching engine realized in FPGA and method |
CN110324204B (en) * | 2019-07-01 | 2020-09-11 | 中国人民解放军陆军工程大学 | High-speed regular expression matching engine and method implemented in FPGA (field programmable Gate array) |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Baez et al. | Categories in control | |
Might et al. | Parsing with derivatives: a functional pearl | |
Whigham | A schema theorem for context-free grammars | |
KR20110062084A (en) | Hybrid translation apparatus and its method | |
CN106779225A (en) | A kind of optimal path method for solving comprising Dominator collection | |
CN109189393A (en) | Method for processing business and device | |
Vieira et al. | Temporal correlations in the simplest measurement sequences | |
CN107193776A (en) | A kind of new transfer algorithm for matching regular expressions | |
Shaw | Picture graphs, grammars, and parsing | |
Melnyk et al. | Grapher: Multi-stage knowledge graph construction using pretrained language models | |
JP2001137788A5 (en) | ||
Lin et al. | Random walk on knot diagrams, colored Jones polynomial and Ihara-Selberg zeta function | |
Siklóssy et al. | Breadth-first search: some surprising results | |
Indu | Technique for conversion of regular expression to and from finite automata | |
JP6629259B2 (en) | Dialog scenario generation apparatus, method, and program | |
CN111857728B (en) | Code abstract generation method and device | |
Yi et al. | Fault tree data structure based on XML and the conversion method to BDD | |
CN102902809B (en) | A kind of Novel semantic association method for digging | |
Thornton | Does conceptual compositionality affect language complexity? Comment on Lou-Magnuson and Onnis | |
Ermel et al. | Modeling multicasting in communication spaces by reconfigurable high-level Petri nets | |
Ezhilarasu et al. | A Novel Approach to Classify Nondeterministic Finite Automata Based on Single Loop and its Position | |
Yan et al. | Multi-drop path model for multicast routing and wavelength assignment | |
Briskilal et al. | An effective enactment of broadcasting XML in wireless mobile environment | |
CN106936716A (en) | A kind of TTP parsings conversion method, forwarding-table item sending method and device | |
Balhorn et al. | Data augmentation for machine learning of chemical process flowsheets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170922 |
|
WD01 | Invention patent application deemed withdrawn after publication |