CN104899264B - A kind of multi-mode matching regular expressions method and device - Google Patents
A kind of multi-mode matching regular expressions method and device Download PDFInfo
- Publication number
- CN104899264B CN104899264B CN201510262867.9A CN201510262867A CN104899264B CN 104899264 B CN104899264 B CN 104899264B CN 201510262867 A CN201510262867 A CN 201510262867A CN 104899264 B CN104899264 B CN 104899264B
- Authority
- CN
- China
- Prior art keywords
- character string
- regular expression
- precise character
- layer
- superset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of multi-mode matching regular expressions method and device, wherein method includes:First layer filtering characteristic set pair data to be matched according to pre-establishing are filtered to obtain the data fragmentation of first layer filtering and the precise character string of hit;Corresponding regular expression superset is searched according to the precise character string of the hit, carrying out the second layer to the data fragmentation that the first layer filters according to the regular expression superset is obtained by filtration the data fragmentation of second layer filtering and the regular expression superset of hit;Corresponding regular expression is determined according to the regular expression superset of the hit, the data fragmentation filtered using the regular expression to the second layer is matched.Technical scheme improves filtering rate and filter effect, and then the stability to ensure matching performance by two layers of filter type, in the case where ensureing aggressive data by filtering, avoids passing through for clean data as far as possible.
Description
Technical field
This application involves technical field of network security, more particularly to multi-mode matching regular expressions method and device.
Background technology
Regular expression is a kind of expression-form for describing character string, possesses free and accurate expressive faculty, in network
Security fields have a wide range of applications, and are often used to network data of the description with attack intension.In intruding detection system, lead to
The regular expression set for describing a large amount of attack signatures can often be included.In detection process, using multi-mode regular expression
The mode matched somebody with somebody matches regular expression set with network data flow, therefrom to find attack.And with internet
Development, network service increase, and network environment is increasingly complicated, and data flow band width is also being continuously increased, multi-mode regular expression
It matches and also requires to be matched faster with less memory space while more more complicated regular expression is accommodated.
Traditional multi-mode matching regular expressions method has three classes:First kind method is non-deterministic stresses NFA
The problem of matching has the advantages that memory space is few, but there is the state of activation of uncertain quantity, and matching speed is usually slower;
Second class method is to determine that finite automata DFA is matched, and has the advantages that matching speed is fast, but for extensive or special write
The complicated regular expression of method, may generate state explosion, so that automatic machine construction overlong time or even memory exhaust nothing
The problem of method constructs;Three classes method is first to carry out pre-filtering using accurate string multi-mode matching or extension automatic machine matching
Match somebody with somebody, when pre-filtering match hit, just indicate that near zone there may be successful match, is at this moment carried out with NFA or DFA again
Confirm, compared to preceding two classes method, it is preferable that three classes method is more easily implemented its scalability.Therefore, at present frequently with three classes
Method, also referred to as pre-filtering matching process, this method specifically include:
It treats matched data flow and carries out character cascade filter, when the keyword in data flow and default tagged word have extremely
During few same characteristic features, show that data flow passes through character cascade filter;To canonical table be carried out by the data flow of character cascade filter
It is matched up to formula.Since the character string in this method is directly extracted from regular expression, the length and quantity of character string are equal
It can not ensure the quality of filtering, such as when what one or more regular expression extracted in all regular expressions is short word
When symbol goes here and there or do not have the character string of discrimination, then the filter effect of this method is bad, causes the number into matching regular expressions
It is huge according to measuring, seriously affect entire matching performance.
The content of the invention
The technical problems to be solved by the invention are to provide multi-mode matching regular expressions method, pass through two layers of filtering side
Formula improves filtering rate and filter effect, and then the stability to ensure matching performance.
The present invention also provides multi-mode matching regular expressions device, to ensure the realization of the above method in practice
And application.
On the one hand, the present invention provides multi-mode matching regular expressions method, this method includes:
First layer filtering characteristic set pair data to be matched according to pre-establishing are filtered to obtain the number of first layer filtering
According to burst and the precise character string of hit;The first layer filtering characteristic collection includes:One extracted from each regular expression
Length is more than the precise character string of predetermined threshold value;
Corresponding regular expression superset is searched according to the precise character string of the hit, is surpassed according to the regular expression
The data fragmentation that first layer described in set pair filters carries out the second layer and the data fragmentation of second layer filtering is obtained by filtration and hits just
Then expression formula superset;The regular expression superset is according to the precise character string of regular expression and the logic of fuzzy strings
The expression formula of relation composition;
Corresponding regular expression is determined according to the regular expression superset of the hit, utilizes the regular expression pair
The data fragmentation of the second layer filtering matches.
Preferably, the first layer filtering characteristic collection is established in the following manner:
Each regular expression is split to obtain corresponding precise character string and fuzzy strings;
The precise character string that length is more than predetermined threshold value is selected from the corresponding precise character string of each regular expression, it will
The precise character string of selection is combined into alternative characters trail;
It is selected according to the priority orders of precise character string from the alternative characters trail for each regular expression
One precise character string is combined into first layer filtering characteristic collection.
Preferably, each regular expression is split to obtain corresponding precise character string and fuzzy strings described
Afterwards, the method further includes:
Merge by being determined of fuzzy strings, and with adjacent precise character string burst.
Preferably, the priority orders according to precise character string are directed to each canonical from the alternative characters trail
Expression formula selects a precise character string to be combined into the first filtering characteristic collection, including:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship
The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer
Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression
Into first layer filtering characteristic collection.
Preferably, the regular expression superset generates in the following manner:
For being split to obtain precise character string and fuzzy strings to regular expression, replaced using logical relation symbol
For the fuzzy strings, regular expression superset is generated according to the precise character string and the logical relation symbol;It is described
Logical relation symbol is used to characterize the logical relation between fuzzy strings precise character string adjacent thereto.
Another aspect, the present invention provides a kind of multi-mode matching regular expressions device, which includes:
First layer filter element, for being filtered according to the first layer filtering characteristic set pair data to be matched pre-established
Obtain the data fragmentation of first layer filtering and the precise character string of hit;The first layer filtering characteristic collection includes:From it is each just
Then a length of expression formula extraction is more than the precise character string of predetermined threshold value;
Second layer filter element, for searching corresponding regular expression superset according to the precise character string of the hit,
The second layer is carried out to the data fragmentation that the first layer filters according to the regular expression superset, second layer filtering is obtained by filtration
Data fragmentation and hit regular expression superset;The regular expression superset is the precise character according to regular expression
The expression formula that the logical relation of string and fuzzy strings forms;
Matching unit determines corresponding regular expression for the regular expression superset according to the hit, utilizes institute
The data fragmentation that regular expression filters the second layer is stated to match.
Preferably, described device further includes:
First layer filtering characteristic collection generation unit, for generating the first layer filtering characteristic collection;
The first layer filtering characteristic collection generation unit, including:
String segmentation subelement, for being split to obtain corresponding precise character string and mould to each regular expression
Paste character string;
Alternative characters trail generates subelement, for selecting length from the corresponding precise character string of each regular expression
More than the precise character string of predetermined threshold value, the precise character string of selection is combined into alternative characters trail;
First filtering characteristic collection generates subelement, for according to the priority orders of precise character string from the alternative characters
For each regular expression a precise character string is selected to be combined into first layer filtering characteristic collection in trail.
Preferably, the first filtering characteristic collection generation unit further includes:
Determinization subelement, for by being determined of fuzzy strings, and with adjacent precise character string burst
Merge.
Preferably, the first layer filtering characteristic collection generation subelement is specifically used for:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship
The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer
Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression
Into first layer filtering characteristic collection.
Preferably, described device further includes:
Regular expression superset generation unit, for being split to obtain precise character string and fuzzy word to regular expression
Symbol string, using fuzzy strings described in logical relation symbolic, according to the precise character string and the logical relation symbol
Generate regular expression superset;The logical relation symbol is used to characterize fuzzy strings precise character string adjacent thereto
Between logical relation.
Compared with prior art, the present invention has the advantages that:
The present invention is proposed improves filter effect by way of two layers of filtering, particularly according to the pre-established
One layer of filtering characteristic set pair data to be matched are filtered to obtain the data fragmentation of first layer filtering and the precise character string of hit;
The first layer filtering characteristic collection includes:The length extracted from each regular expression is more than the precise character of predetermined threshold value
String;Here first layer filtering be it is more few better according to character string quantity, string length selected the characteristics of the longer the better the
One filtering characteristic collection only selects a character string from each regular expression;Accurate string of the prior art is different from, so as to
So that first layer filtering can be played the role of reducing clean data percent of pass, primarily serve " rate of filtration maximization ".
When first layer filters completion, second layer filtering is then carried out, specifically according to the precise character string of the hit
Corresponding regular expression superset is searched, the data fragmentation that the first layer filters is carried out according to the regular expression superset
The data fragmentation of second layer filtering and the regular expression superset of hit is obtained by filtration in the second layer;The regular expression superset is
The expression formula formed according to the logical relation of the precise character string of regular expression and fuzzy strings;Since second layer filtering is adopted
It is regular expression superset, descriptive power has been sufficiently close to original regular expression, can accomplish to arrange as far as possible
Except unmatched data, to filter out the data fragmentation closest to original regular expression, substantial amounts of muddy data is avoided to enter
Matching regular expressions stage to the end, second layer filtering primarily serve filter effect and maximumlly act on, and to carry indirectly
High matching efficiency, while multi-mode matching is decomposed into the single mode matching in different scenes (different data burst), and then avoid
The problem of DFA volume expansions or excessively slow NFA matching speeds.Finally, the data fragmentation second layer being obtained by filtration, it is right using its
The regular expression answered makees wall scroll matching.Therefore, technical scheme by two layers filter type improve filtering rate and
Filter effect, and then the stability to ensure matching performance in the case where ensureing aggressive data by filtering, avoid as far as possible
Clean data passes through.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present application, for
For those of ordinary skill in the art, without having to pay creative labor, it can also be obtained according to these attached drawings
His attached drawing.
Fig. 1 is the flow chart of the multi-mode matching regular expressions embodiment of the method for the present invention;
Fig. 2 is the flow chart of the generation method of first layer filtering characteristic collection provided by the invention;
Fig. 3 is the structure chart of the multi-mode matching regular expressions device embodiment of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, the technical solution in the embodiment of the present application is carried out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, those of ordinary skill in the art are obtained every other without making creative work
Embodiment shall fall in the protection scope of this application.
The application can be used in numerous general or special purpose computing device environment or configuration.Such as:Personal computer, service
Device computer, handheld device or portable device, laptop device, multi-processor device, including any of the above device or equipment
Distributed computing environment etc..
The application can be described in the general context of computer executable instructions, such as program
Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group
Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these distributed computing environment, by
Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage device.
With reference to figure 1, Fig. 1 is the flow chart of the multi-mode matching regular expressions embodiment of the method for the present invention, and this method can
To include:
S101 is filtered to obtain first layer filtering according to the first layer filtering characteristic set pair data to be matched pre-established
Data fragmentation and hit precise character string;The first layer filtering characteristic collection includes:It is extracted from each regular expression
One length is more than the precise character string of predetermined threshold value.
Here first layer filtering characteristic collection should previously generate before matching process is realized, the first layer filtering characteristic collection
For each regular expression only comprising a precise character string, and the precise character string must is fulfilled for length more than predetermined threshold value
Condition.
For how to generate first layer filtering characteristic collection, the invention also provides concrete implementation method, referring to Fig. 2, Fig. 2
It is the flow chart of the generation method of first layer filtering characteristic collection provided by the invention, this method includes:
S201 is split each regular expression to obtain corresponding precise character string and fuzzy strings.
Explanation is needed exist for, regular expression may be different according to its different expression-form of grammer, under normal circumstances
Comprising:The elements such as letter, number, quotation mark or special grammar symbol;It, can be by a canonical according to certain segmentation condition
Expression formula is divided into two kinds of character string, and both types character string is referred to as:Precise character string and fuzzy strings,
The number for splitting obtained both types character string is indefinite.
Inventor considers that precise character string length is longer during realization, the higher spy of the efficiency that first layer filters
Point further provides a kind of realization method, in the case where not reducing semantic coverage, as much as possible by precise character
String length maximizes, and specific implementation is after above-mentioned S201, increases following steps:
Merge by being determined of fuzzy strings, and with adjacent precise character string burst.Then after carrying out again
Continuous step, and subsequent step is that the precise character string after merging is handled accordingly.
After S201 is completed, S202 and S203 is then carried out.
S202 selects the precise character that length is more than predetermined threshold value from the corresponding precise character string of each regular expression
String, alternative characters trail is combined by the precise character string of selection.
S203 is directed to each regular expression according to the priority orders of precise character string from the alternative characters trail
One precise character string of selection is combined into first layer filtering characteristic collection.
Specific implementation when, S203 can be according to the descending order of precise character string length come set priority by
High to Low order, then selecting the precise character string of corresponding highest priority for each regular expression, (length is most long
Precise character string), with combination producing first layer filtering characteristic collection.
In order to further ensure first layer filter effect, inventor also proposed carrys out adaptability tune with reference to actual filtration result
Whole first layer filtering characteristic collection, specific implementation are:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship
The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer
Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression
Into first layer filtering characteristic collection.
Above-mentioned realization method is it is to be understood that in the starting stage, first layer filtering characteristic collection is to directly select each canonical
The precise character string (the longest precise character string of length) of the corresponding highest priority of expression formula;During the late stages of developmet, according to first
The result of layer filtering and second layer filtering to carry out accommodation to priority orders, if precise character string is in first layer mistake
Hit is sent during filter, but is not hit when the second layer filters, at this point, the priority of the precise character string is minimized,
And by the priority update of precise character string adjacent with its order in alternative characters trail for the superlative degree.In entire matching process
In, using above-mentioned polling mode, the precise character string with higher differentiation degree is allowed to be used as first layer for a long time as much as possible
The element of filtering characteristic collection, to ensure that the actual effect of first layer filtering is optimal.
In specific implementation, can be adjusted according to the set-up mode dynamic of above-mentioned priority orders in alternative characters set of strings
Precise character string priority sequence, so as to dynamic adjustment first layer filtering characteristic collection so that it is according to actual filtration
Situation carries out dynamic change, to meet first layer filtration needs.
In order to be better understood from above-mentioned steps S101, explanation is further explained to it by taking concrete scene as an example below.
In attack detecting application field, the regular expression used under normal circumstances has the feature of apparent discrimination;
The present invention is exactly to make full use of the difference of the described data of regular expression and common normal data, to realize preferable filtering
Function, to increase substantially detection efficiency.
The essence of above-mentioned first layer filtering is to match the accurate of most apparent discrimination in each regular expression
Character string, to ensure that " pure " data need not go deep into detection process into subsequent, also for follow-up step provide it is more accurate,
Smaller " muddiness " data of data volume.Under normal circumstances, substantial amounts of network data is " pure " data, then by first layer
After filtering, these " pure " data are filtered not further into the subsequent second layer and matching treatment, greatly simplifie follow-up place
Reason.Here " pure " data refer to non-suspicious network data, and " muddiness " data refer to suspicious network data.
The precise character string for obtaining a data fragmentation and being hit is possibly filtered out by first layer filtering, also may be used
It can obtain multiple data fragmentations and multiple precise character strings being hit, it is also possible to be not to be filled into data fragmentation, not have
There is precise character string to be hit;In the case where not being filled into data fragmentation, illustrate that data to be matched do not meet matching rule
Then, it is believed that data to be matched are clean datas, then without carrying out follow-up each step, directly terminate this matching process.
When first layer filters out data fragmentation, into subsequent S102 and S103.
Realization the step S102 and S103 of the present embodiment are then described.
S102 searches corresponding regular expression superset, according to the canonical table according to the precise character string of the hit
The data fragmentation and life that second layer filtering is obtained by filtration in the second layer are carried out to the data fragmentation that the first layer filters up to formula superset
In regular expression superset;The regular expression superset is the precise character string and fuzzy strings according to regular expression
Logical relation composition expression formula.
Here regular expression superset should previously generate before second layer matching is realized, each regular expression
Regular expression superset there are one corresponding, for how to generate regular expression superset, the invention also provides concrete implementations
Method, this method include:
Regular expression is split to obtain precise character string and fuzzy strings, using logical relation symbolic institute
Fuzzy strings are stated, regular expression superset is generated according to the precise character string and the logical relation symbol;The logic
Relational symbol is used to characterize the logical relation between fuzzy strings precise character string adjacent thereto.
Explanation is needed exist for, superset is the conceptive understanding from data acquisition system, and the superset of a data acquisition system must
So logical relation of all elements comprising this data acquisition system and each element.The member that regular expression superset is covered
Plain scope is much larger than the elemental range that the regular expression is covered.It can be seen that from the generation method and accorded with using logical relation
Number instead of fuzzy strings, since the elemental range that logical relation symbol is covered is much larger than the model that fuzzy strings are covered
It encloses, hence the scope for the superset established is much larger than the scope of regular expression.Due to having between regular expression character string
Certain ordinal relation has certain data length, in order to establish an appropriate superset of scope, logical relation here
Symbol can be order with, with or the forms such as length decision symbol.
The essence of second layer filtering is word matching process and semantic matching process.Institute's predicate matching process refers to and canonical table
It is matched up to the precise character string in formula superset;The semantic matching process refers to be determined according to word matching result super with which
Collection syntax tree is matched, and is traveled through syntax tree, when syntax tree is hit, is shown that the data fragmentation is filtered by the second layer, depending on
It is hit for regular expression superset.Here syntax tree is substantially another display form of regular expression superset, syntax tree
It is using the precise character string in superset as terminal node, using the logical relation symbol in superset as nonterminal node.
In order to be better understood from above-mentioned steps S102, explanation is further explained to it by taking concrete scene as an example below.
In attack detecting application field, second layer filtering is carried out again primarily to keeping away after the filtering of above-mentioned first layer
Exempt to occur " muddiness " data in large amount of complex network environment, block matching regular expressions link to the end, therefore, second
The data that cannot hit mainly are screened in layer filtering, directly by these data distributions to reduce the data of matching regular expressions
Amount, to avoid there is the problem of volume expansions problem of DFA or slow NFA matching speeds.
When completing second layer filtering, and being filled into some data fragmentations, S103 is put into be carried out most to data fragmentation
Matching confirms eventually.
S103 determines corresponding regular expression according to the regular expression superset of the hit, utilizes the canonical table
The data fragmentation filtered up to formula to the second layer matches.
Since most of data in data to be matched are " pure " data, in first layer filtering and second layer filtering
During processing, most of data distribution is fallen, it is considerably less to enter the data of final tache, and because when the second layer filters
It can determine that data fragmentation is matched with which regular expression most probable, therefore, multi-mode matching decomposed different
The single mode matching (the corresponding single mode matching of data fragmentation) of scene, based on this, when S103 is implemented, may be employed non-determined has
Automatic machine (NFA) method of limit realizes matching, can also realize matching by the way of deterministic stresses (DFA).
In addition, inventor is on the basis of above-mentioned two layers filtering, it is also proposed that step S103 can be only to regular expression
Fuzzy strings part, which carries out structure, can complete final matching.It is primarily due in first layer filtering and second layer filtering
When have determined that data fragmentation precise character string matches, therefore, can also be only to fuzzy strings in the last one link
Part is matched.
Based on foregoing description, above-mentioned S103 can have following two realization methods in specific implementation:
(1) plants mode:The hit is being stated by constructing non-determined finite automata or deterministic stresses just
The then fuzzy strings in the corresponding regular expression of expression formula superset, using construction non-determined finite automata or determined
The data fragmentation that limit automatic machine filters the second layer matches.
(2) plant mode:The hit is being stated by constructing non-determined finite automata or deterministic stresses just
The then corresponding regular expression of expression formula superset, using the non-determined finite automata or deterministic stresses of construction to second
The data fragmentation of layer filtering is matched.
The above-described generation method with the exemplary first filtering characteristic collection of Fig. 2, below by specific example to it
The realization process of S201 is further explained explanation, is mainly illustrated by taking PCRE regular expression grammers as an example.
Regular expression is split, mainly using fuzzy factor as separator, regular expression is divided into essence
True character string and fuzzy strings.The fuzzy factor of regular expression refers to be not sure to the symbol for representing an ascii character
Number and combinations thereof.
According to the grammar property of PCRE grammers (regular expression grammer), fuzzy factor can be defined as follows:
(1) metacharacter:" ", " ^ ", " $ ".
(2) branch accords with:“|”.
(3) group separator:(..).
(4) suffix symbol or phrase:"+", "", " * ", { m, n }, { m, }, n }, { m }.
(5) character phrase, such as [a-z], [^0-9], [[:alpha:]] etc..
(6) quotation mark, such as regular expression are described as ([D dSsW w B beZ A ab tnvfr0-9])
| ([Pp] (({ [a-zA-Z_] * }) | ([a-zA-Z]))), then d, f, n, pF, p { N am e } for quote
Symbol.
(7) high-level syntax's symbol of inertia matching and modification match pattern, such as:" *", "+", "", " m,
n}", " Q ... E ", " (I) ", " (- i) ", " (S) ", " (- s) ", " (M) ", " (- m) " etc..
In specific implementation, fuzzy factor can also be the combining form of any of the above form.
Secondly, ambiguity in definition decision condition:
(a) for suffix symbol or phrase, a fuzzy strings, such as regular expression are merged into its grammer prefix
Abc+d is divisible to obtain fuzzy strings c+;
(b) for switch, if being used in combination with group separator, by all branches in grouping as an ambiguous characters
String, does not otherwise split;
(c) remaining fuzzy factor is as separator, itself is as a fuzzy strings.
(d) it is all to be blurred the separated character string of substring, as accurate string character string.
Finally, by the fuzzy judgement condition of definition, regular expression is scanned, is divided into the precise character string of order
And fuzzy strings, and the position relationship between reserved character string.
It is exemplified below:
Example 1:Regular expression
" (20 [01] [and x09- x0d-~] * (A U T H IN FO U SE R | new s) finger " it is divisible
Into " 20, [01] [ x09- x0d-~] * (A U T H IN FO U SE R | new s), 3 characters of finger "
String.First " 20 " and the last one " finger " are precise character string, intermediate " [01] [ x09- x0d-~] *
(A U T H IN FO U SE R | new s) " it is fuzzy strings.
It upon splitting, can also be further to the processing of being determined of fuzzy strings, followed by citing side
The determinization process of fuzzy strings is explained in formula.
Determinization method is the further determining that of fuzzy strings for some specific compositions, and implementation principle is:Behaviour
Make simply, not reducing semantic coverage, adjacent precise character string merges, the character string controllable quantity of segmentation.For example, it can divide
Step is implemented:
(1) for including the fuzzy strings of suffix symbol or phrase, determined to turn to the character string for carrying precise character,
Such as " a { m, n }, a { m }, a { m, } " is determined to turn to " a ... a { 0, n-m }, a ... a { 0 }, a ... a* " wherein " ... " expression a exhibitions
It opens m times,;Precise character after determinization is merged with the precise character string that adjacent segmentation obtains.
Example 2:" abc { 3,10 } de " is divided into " ab, c { 3,10 }, de ", by fuzzy strings determinization to regular expression
C { 3,10 } is determined to turn to cccc { 0,7 }, then its precise character string " ab " merging adjacent with the left side is obtained into " abccc, c
{ 0,7 }, de ".
For another example " a+ " is determined to turn to " aa* ", the precise character after determinization and adjacent segmentation are obtained
Precise character string merges.
Example 3:Regular expression " abc+de " is divided into that " fuzzy strings therein " c+ " are determined to turn to by ab, c+, de "
" cc* " according still further to the character string position relationship after segmentation, itself and adjacent precise character string " ab " is merged, obtained
" abc, c*, de ";
Example 4:Regular expression " ab (cd)+de " be divided into " ab, (cd)+, de ", by fuzzy strings therein " (cd)
+ " determine to turn to (cd) (cd) *, it finally merges to obtain " ab, (cd), (cd) *, de ".
It (2), can if internal do not include branch's symbol " | " and suffix symbol or phrase for the fuzzy strings comprising group separator
It directly to delete group separator, and is merged as far as possible with adjacent precise character string, in this case, above-mentioned example 4 can be further
It determines to turn to " abcd, (cd) *, de ".
(3) fuzzy strings for being accorded with comprising branch, it tries extraction common prefix and suffix character string, and as far as possible with
Adjacent precise character string merges.
Example 5:Regular expression " x (abcde | abcfe) y " is divided into that " x, (abcde | abcfe) determine to turn to after y "
" xabc, (and d | f), ey ".
If without common prefix and suffix character string, but numbers of branches is less, and each branch has longer accurate word
Symbol string can carry out branch's expansion to fuzzy strings.
Such as:Above-mentioned example 1 is carried out determine after branch expansion turning to " 20, [01] [ x09- x0d-~] *, A U
T H IN FO U SE R, finger " or " 20, [01] [ x09- x0d-~] *, news, finger ".
By the specific implementation of above-mentioned several determinizations, can to some further determining that of fuzzy strings,
To achieve the purpose that extend precise character string length.
In addition, the generation method on above-described regular expression superset, below with PCRE grammer regular expressions
Exemplified by, to describe how to generate regular expression superset.
The generation method of regular expression superset described above includes:Regular expression is split to obtain accurate
Character string and fuzzy strings, using fuzzy strings described in logical relation symbolic, according to the precise character string and institute
State logical relation symbol generation regular expression superset;The logical relation symbol is used to characterize the fuzzy strings and its phase
Logical relation between adjacent precise character string.
The partitioning portion that this method is related to may refer to obtaining precise character to the corresponding description section of S201 above
After string and fuzzy strings, can also further fuzzy strings be made with determinization processing, delete the mould for failing determinization
Character string is pasted, and records the data length scope of the fuzzy strings.Then will, according to the home position relation between character string and
Logical relation is reassembled into new expression formula, and the data acquisition system of this new expression formula description is exactly the super of former regular expression
Collection, referred to as regular expression superset, a regular expression correspond to a superset in the present invention.
Regular expression superset allows the data of successful match more extensive, and eliminates space in former regular expression and answer
Miscellaneous degree and all relatively higher fuzzy factor multi-mode matching part of time complexity, have higher matching efficiency, thus can
To improve the performance of second layer filtering.
The regular expression of different grammers can be directed in practical applications, generated with reference to the characteristics of regular expression pair
The superset answered can be directed to the common corresponding logical symbol of fuzzy strings formal definition or decision symbol, in general, until
It to include less:Order with, with or, the logical relations symbol such as length decision symbol.
A kind of example is given below:
(1) " sequentially with " relation between character string is represented with " .. ", example 6:The expression formula of regular expression abc.*def surpasses
Collect for " abc " .. " def ", express first matched character string " abc " and match again " def ".
(2) the "or" relation between character string is represented with " | ".
Such as:Precedent 1 expression formula superset for " 20 " .. " A U T H IN FO U SE R " .. " finger " or
" 20 " .. " news " .. " finger ", introduce or symbol after be further represented as:(“20”..“A U T H IN FO U SE
R”..“finger”)|(“20”..“news”..“finger”)。
(3) the "AND" relation between character string is represented with " & ".
Preferably, (" A " .. " B ") | (" B " .. " A ") can be optimized for " A " & " B ", example 7:Regular expression " abc.*
Def " | the expression formula superset of " def " .* " abc " is " abc " & " def ".
(4) as general expression formula, character string promotes priority level by ().Or priority need not promoted
In the case of, it is grouped for explicit character string.For example it is used with reference to length decision symbol.Length decision symbol includes:>, >
=, <, <=,==.
Example 8:The expression formula superset of regular expression " abc { 3 } def " for (" abc " .. " def ")==8, expression first match
It is 8 that " abc " matches " def " and matching area size again.
It is as shown in the table to summarize example above:
Described above is multi-mode matching regular expressions method provided by the invention, below to multimode provided by the invention
Formula matching regular expressions device is explained.
With reference to figure 3, Fig. 3 is the structure chart of the multi-mode matching regular expressions device embodiment of the present invention, which can
To include:
First layer filter element 301, for being carried out according to the first layer filtering characteristic set pair pre-established data to be matched
The data fragmentation of first layer filtering and the precise character string of hit is obtained by filtration;The first layer filtering characteristic collection includes:From every
One length of a regular expression extraction is more than the precise character string of predetermined threshold value;
Second layer filter element 302 surpasses for searching corresponding regular expression according to the precise character string of the hit
Collection carries out the second layer to the data fragmentation that the first layer filters according to the regular expression superset and second layer mistake is obtained by filtration
The data fragmentation of filter and the regular expression superset of hit;The regular expression superset is the accurate word according to regular expression
The expression formula that the logical relation of symbol string and fuzzy strings forms;
Matching unit 303 for determining corresponding regular expression according to the regular expression superset of the hit, utilizes
The data fragmentation that the regular expression filters the second layer matches.
Preferably, described device further includes:
First layer filtering characteristic collection generation unit, for generating the first layer filtering characteristic collection;
The first layer filtering characteristic collection generation unit, including:
String segmentation subelement, for being split to obtain corresponding precise character string and mould to each regular expression
Paste character string;
Alternative characters trail generates subelement, for selecting length from the corresponding precise character string of each regular expression
More than the precise character string of predetermined threshold value, the precise character string of selection is combined into alternative characters trail;
First filtering characteristic collection generates subelement, for according to the priority orders of precise character string from the alternative characters
For each regular expression a precise character string is selected to be combined into first layer filtering characteristic collection in trail.
Preferably, the first filtering characteristic collection generation unit further includes:
Determinization subelement, for by being determined of fuzzy strings, and with adjacent precise character string burst
Merge.
Preferably, the first layer filtering characteristic collection generation subelement is specifically used for:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship
The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer
Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression
Into first layer filtering characteristic collection.
Preferably, described device further includes:
Regular expression superset generation unit, for being split to obtain precise character string and fuzzy word to regular expression
Symbol string, using fuzzy strings described in logical relation symbolic, according to the precise character string and the logical relation symbol
Generate regular expression superset;The logical relation symbol is used to characterize fuzzy strings precise character string adjacent thereto
Between logical relation.
The present invention is proposed improves filter effect by way of two layers of filtering, particularly according to the pre-established
One layer of filtering characteristic set pair data to be matched are filtered to obtain the data fragmentation of first layer filtering and the precise character string of hit;
The first layer filtering characteristic collection includes:The length extracted from each regular expression is more than the precise character of predetermined threshold value
String;Here the precise character string of first layer filtering characteristic collection protection is extracted according to length scale, is different from the prior art
In accurate string so that first layer filtering can be played the role of reducing clean data percent of pass, due to first layer filter it is special
Precise character string in collection has the characteristics that each regular expression only selects a precise character string to participate in, and can ensure
Filtering velocity rate.
When first layer filters completion, second layer filtering is then carried out, specifically according to the precise character string of the hit
Corresponding regular expression superset is searched, the data fragmentation that the first layer filters is carried out according to the regular expression superset
The data fragmentation of second layer filtering and the regular expression superset of hit is obtained by filtration in the second layer;The regular expression superset is
The expression formula formed according to the logical relation of the precise character string of regular expression and fuzzy strings;Since second layer filtering is adopted
It is regular expression superset, descriptive power has been sufficiently close to original regular expression, can accomplish to arrange as far as possible
Except unmatched data, to filter out the data fragmentation closest to original regular expression, avoid substantial amounts of " muddiness " data into
Enter the matching regular expressions stage to the end, to improve matching efficiency indirectly, while multi-mode matching is decomposed into different fields
Single mode matching on scape (different data burst), and then the problem of avoid DFA volume expansions or excessively slow NFA matching speeds.Most
Afterwards, the data fragmentation second layer being obtained by filtration makees wall scroll matching using its corresponding regular expression.Therefore, it is of the invention
Technical solution improves filtering rate and filter effect, and then the stability to ensure matching performance by two layers of filter type,
In the case of ensureing aggressive data by filtering, passing through for clean data is avoided as far as possible.
It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight
Point explanation is all difference from other examples, and just to refer each other for identical similar part between each embodiment.
For device class embodiment, since it is basicly similar to embodiment of the method, so description is fairly simple, related part ginseng
See the part explanation of embodiment of the method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, term " comprising ", "comprising" or its any other variant meaning
Covering non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only include that
A little elements, but also including other elements that are not explicitly listed or further include for this process, method, article or
The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged
Except also there are other identical elements in the process, method, article or apparatus that includes the element.
Multi-mode matching regular expressions method and device provided herein is described in detail above, herein
In apply specific case the principle and implementation of this application are described, the explanation of above example is only intended to sides
Assistant solves the present processes and its core concept;Meanwhile for those of ordinary skill in the art, the think of according to the application
Think, in specific embodiments and applications there will be changes, in conclusion this specification content should not be construed as pair
The limitation of the application.
Claims (10)
- A kind of 1. multi-mode matching regular expressions method, which is characterized in that this method includes:First layer filtering characteristic set pair data to be matched according to pre-establishing are filtered to obtain the data point of first layer filtering Piece and the precise character string of hit;The first layer filtering characteristic collection includes:The length extracted from each regular expression More than the precise character string of predetermined threshold value;Corresponding regular expression superset is searched according to the precise character string of the hit, according to the regular expression superset pair The data fragmentation of the first layer filtering carries out the second layer and the data fragmentation of second layer filtering and the canonical table of hit is obtained by filtration Up to formula superset;The regular expression superset is according to the precise character string of regular expression and the logical relation of fuzzy strings The expression formula of composition;Corresponding regular expression is determined according to the regular expression superset of the hit, using the regular expression to described The data fragmentation of second layer filtering matches.
- 2. according to the method described in claim 1, it is characterized in that, the first layer filtering characteristic collection is built in the following manner It is vertical:Each regular expression is split to obtain corresponding precise character string and fuzzy strings;The precise character string that length is more than predetermined threshold value is selected from the corresponding precise character string of each regular expression, will be selected Precise character string be combined into alternative characters trail;According to the priority orders of precise character string one is selected for each regular expression from the alternative characters trail Precise character string is combined into first layer filtering characteristic collection.
- 3. according to the method described in claim 2, it is characterized in that, each regular expression is split to obtain pair described After the precise character string and fuzzy strings answered, the method further includes:Merge by being determined of fuzzy strings, and with adjacent precise character string burst.
- 4. according to the method in claim 2 or 3, which is characterized in that the priority orders according to precise character string from For each regular expression a precise character string is selected to be combined into the first filtering characteristic collection in the alternative characters trail, wrapped It includes:The corresponding accurate word of each regular expression in the alternative characters trail is set according to string length magnitude relationship Accord with string priority orders, and the priority orders in use according to first layer filtering and the second layer filter result into Row adjustment;The precise character string of highest priority is selected to be combined into the from the corresponding precise character string of each regular expression One layer of filtering characteristic collection.
- 5. according to the method described in claim 1, it is characterized in that, the regular expression superset generates in the following manner:Regular expression is split to obtain precise character string and fuzzy strings, using mould described in logical relation symbolic Character string is pasted, regular expression superset is generated according to the precise character string and the logical relation symbol;The logical relation Symbol is used to characterize the logical relation between fuzzy strings precise character string adjacent thereto.
- 6. a kind of multi-mode matching regular expressions device, which is characterized in that the device includes:First layer filter element, for being filtered to obtain according to the first layer filtering characteristic set pair data to be matched pre-established The data fragmentation of first layer filtering and the precise character string of hit;The first layer filtering characteristic collection includes:From each canonical table The length extracted up to formula is more than the precise character string of predetermined threshold value;Second layer filter element, for searching corresponding regular expression superset according to the precise character string of the hit, according to The regular expression superset carries out the data fragmentation that the first layer filters the number that second layer filtering is obtained by filtration in the second layer According to burst and the regular expression superset of hit;The regular expression superset be according to the precise character string of regular expression and The expression formula of the logical relation composition of fuzzy strings;Matching unit determines corresponding regular expression for the regular expression superset according to the hit, using it is described just The data fragmentation that then expression formula filters the second layer matches.
- 7. device according to claim 6, which is characterized in that described device further includes:First layer filtering characteristic collection generation unit, for generating the first layer filtering characteristic collection;The first layer filtering characteristic collection generation unit, including:String segmentation subelement, for being split to obtain corresponding precise character string and fuzzy word to each regular expression Symbol string;Alternative characters trail generates subelement, for length to be selected to be more than from the corresponding precise character string of each regular expression The precise character string of selection is combined into alternative characters trail by the precise character string of predetermined threshold value;First filtering characteristic collection generates subelement, for according to the priority orders of precise character string from the alternative characters trail In for each regular expression select a precise character string be combined into first layer filtering characteristic collection.
- 8. device according to claim 7, which is characterized in that the first filtering characteristic collection generation unit further includes:Determinization subelement, for merging by being determined of fuzzy strings, and with adjacent precise character string burst.
- 9. the device according to claim 7 or 8, which is characterized in that the first layer filtering characteristic collection generation subelement tool Body is used for:The corresponding accurate word of each regular expression in the alternative characters trail is set according to string length magnitude relationship Accord with string priority orders, and the priority orders in use according to first layer filtering and the second layer filter result into Row adjustment;The precise character string of highest priority is selected to be combined into the from the corresponding precise character string of each regular expression One layer of filtering characteristic collection.
- 10. device according to claim 6, which is characterized in that described device further includes:Regular expression superset generation unit, for being split to obtain precise character string and ambiguous characters to regular expression String using fuzzy strings described in logical relation symbolic, is given birth to according to the precise character string and the logical relation symbol Into regular expression superset;The logical relation symbol for characterize fuzzy strings precise character string adjacent thereto it Between logical relation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510262867.9A CN104899264B (en) | 2015-05-21 | 2015-05-21 | A kind of multi-mode matching regular expressions method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510262867.9A CN104899264B (en) | 2015-05-21 | 2015-05-21 | A kind of multi-mode matching regular expressions method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104899264A CN104899264A (en) | 2015-09-09 |
CN104899264B true CN104899264B (en) | 2018-05-29 |
Family
ID=54031927
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510262867.9A Active CN104899264B (en) | 2015-05-21 | 2015-05-21 | A kind of multi-mode matching regular expressions method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104899264B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106911647A (en) * | 2015-12-23 | 2017-06-30 | 北京奇虎科技有限公司 | A kind of method and apparatus for detecting network attack |
CN106911649A (en) * | 2015-12-23 | 2017-06-30 | 北京奇虎科技有限公司 | A kind of method and apparatus for detecting network attack |
CN106202004B (en) * | 2016-07-13 | 2019-10-11 | 上海轻维软件有限公司 | Combined data cutting method based on regular expressions and separator |
CN108062295B (en) * | 2016-11-09 | 2021-11-05 | 北京国双科技有限公司 | Content processing method and device |
CN107633074B (en) * | 2017-09-22 | 2020-06-09 | 咪咕文化科技有限公司 | Information extraction method and device and storage medium |
CN107992481B (en) * | 2017-12-25 | 2021-05-04 | 鼎富智能科技有限公司 | Regular expression matching method, device and system based on multi-way tree |
CN108920463A (en) * | 2018-06-29 | 2018-11-30 | 北京奇虎科技有限公司 | A kind of segmenting method and system based on network attack |
CN109871502B (en) * | 2019-01-18 | 2020-10-30 | 北京赛思信安技术股份有限公司 | Stream data regular matching method based on Storm |
CN110096626A (en) * | 2019-03-18 | 2019-08-06 | 平安普惠企业管理有限公司 | Processing method, device, equipment and the storage medium of contract text data |
CN111125693A (en) * | 2019-12-18 | 2020-05-08 | 杭州安恒信息技术股份有限公司 | Equipment safety protection method, device and equipment |
CN111556014B (en) * | 2020-03-24 | 2022-07-15 | 华东电力试验研究院有限公司 | Network attack intrusion detection method adopting full-text index |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100485684C (en) * | 2006-10-08 | 2009-05-06 | 中国科学院软件研究所 | Text content filtering method and system |
CN101685502A (en) * | 2008-09-24 | 2010-03-31 | 华为技术有限公司 | Mode matching method and device |
CN102521357A (en) * | 2011-12-13 | 2012-06-27 | 曙光信息产业(北京)有限公司 | System and method for achieving accurate matching of texts by automaton |
CN102523219B (en) * | 2011-12-16 | 2015-01-14 | 清华大学 | Regular expression matching system and regular expression matching method |
CN102857493B (en) * | 2012-06-30 | 2015-07-08 | 华为技术有限公司 | Content filtering method and device |
-
2015
- 2015-05-21 CN CN201510262867.9A patent/CN104899264B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN104899264A (en) | 2015-09-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104899264B (en) | A kind of multi-mode matching regular expressions method and device | |
Dharmapurikar et al. | Fast and scalable pattern matching for content filtering | |
Kumar et al. | Advanced algorithms for fast and scalable deep packet inspection | |
US7464254B2 (en) | Programmable processor apparatus integrating dedicated search registers and dedicated state machine registers with associated execution hardware to support rapid application of rulesets to data | |
CN101154228A (en) | Partitioned pattern matching method and device thereof | |
CN102857493A (en) | Content filtering method and device | |
Le et al. | A memory-efficient and modular approach for large-scale string pattern matching | |
CN102163234A (en) | Equipment and method for error correction of query sequence based on degree of error correction association | |
JP6592310B2 (en) | Semiconductor device | |
CN103188267B (en) | A kind of protocol analysis method based on DFA | |
CN105045808B (en) | A kind of compound rule collection matching process and system | |
CN106909630A (en) | Filtering sensitive words method and system based on dynamic dictionary | |
EP3077922B1 (en) | Method and apparatus for generating a plurality of indexed data fields | |
CN109800337B (en) | Multi-mode regular matching algorithm suitable for large alphabet | |
CN102867049A (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
CN100530194C (en) | Key words matching method and system | |
CN103324886A (en) | Method and system for extracting fingerprint database in network intrusion detection | |
CN102298618B (en) | Method for obtaining matching degree to execute corresponding operations and device and equipment | |
CN110505322A (en) | A kind of IP address section lookup method and device | |
CN117763077A (en) | Data query method and device | |
CN102521357A (en) | System and method for achieving accurate matching of texts by automaton | |
CN107038452A (en) | Telephone number recognition methods and device | |
CN100483402C (en) | Programmable rule processing apparatus for conducting high speed contextual searches & characterzations of patterns in data | |
CN104407849B (en) | A kind of finite automaton generation method with asterisk wildcard regular expression | |
CN106980653B (en) | DFA compression method and device, regular expression matching method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |