CN104899264B - A kind of multi-mode matching regular expressions method and device - Google Patents

A kind of multi-mode matching regular expressions method and device Download PDF

Info

Publication number
CN104899264B
CN104899264B CN201510262867.9A CN201510262867A CN104899264B CN 104899264 B CN104899264 B CN 104899264B CN 201510262867 A CN201510262867 A CN 201510262867A CN 104899264 B CN104899264 B CN 104899264B
Authority
CN
China
Prior art keywords
character string
regular expression
precise character
layer
superset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510262867.9A
Other languages
Chinese (zh)
Other versions
CN104899264A (en
Inventor
侯智瀚
邹荣珠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201510262867.9A priority Critical patent/CN104899264B/en
Publication of CN104899264A publication Critical patent/CN104899264A/en
Application granted granted Critical
Publication of CN104899264B publication Critical patent/CN104899264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of multi-mode matching regular expressions method and device, wherein method includes:First layer filtering characteristic set pair data to be matched according to pre-establishing are filtered to obtain the data fragmentation of first layer filtering and the precise character string of hit;Corresponding regular expression superset is searched according to the precise character string of the hit, carrying out the second layer to the data fragmentation that the first layer filters according to the regular expression superset is obtained by filtration the data fragmentation of second layer filtering and the regular expression superset of hit;Corresponding regular expression is determined according to the regular expression superset of the hit, the data fragmentation filtered using the regular expression to the second layer is matched.Technical scheme improves filtering rate and filter effect, and then the stability to ensure matching performance by two layers of filter type, in the case where ensureing aggressive data by filtering, avoids passing through for clean data as far as possible.

Description

A kind of multi-mode matching regular expressions method and device
Technical field
This application involves technical field of network security, more particularly to multi-mode matching regular expressions method and device.
Background technology
Regular expression is a kind of expression-form for describing character string, possesses free and accurate expressive faculty, in network Security fields have a wide range of applications, and are often used to network data of the description with attack intension.In intruding detection system, lead to The regular expression set for describing a large amount of attack signatures can often be included.In detection process, using multi-mode regular expression The mode matched somebody with somebody matches regular expression set with network data flow, therefrom to find attack.And with internet Development, network service increase, and network environment is increasingly complicated, and data flow band width is also being continuously increased, multi-mode regular expression It matches and also requires to be matched faster with less memory space while more more complicated regular expression is accommodated.
Traditional multi-mode matching regular expressions method has three classes:First kind method is non-deterministic stresses NFA The problem of matching has the advantages that memory space is few, but there is the state of activation of uncertain quantity, and matching speed is usually slower; Second class method is to determine that finite automata DFA is matched, and has the advantages that matching speed is fast, but for extensive or special write The complicated regular expression of method, may generate state explosion, so that automatic machine construction overlong time or even memory exhaust nothing The problem of method constructs;Three classes method is first to carry out pre-filtering using accurate string multi-mode matching or extension automatic machine matching Match somebody with somebody, when pre-filtering match hit, just indicate that near zone there may be successful match, is at this moment carried out with NFA or DFA again Confirm, compared to preceding two classes method, it is preferable that three classes method is more easily implemented its scalability.Therefore, at present frequently with three classes Method, also referred to as pre-filtering matching process, this method specifically include:
It treats matched data flow and carries out character cascade filter, when the keyword in data flow and default tagged word have extremely During few same characteristic features, show that data flow passes through character cascade filter;To canonical table be carried out by the data flow of character cascade filter It is matched up to formula.Since the character string in this method is directly extracted from regular expression, the length and quantity of character string are equal It can not ensure the quality of filtering, such as when what one or more regular expression extracted in all regular expressions is short word When symbol goes here and there or do not have the character string of discrimination, then the filter effect of this method is bad, causes the number into matching regular expressions It is huge according to measuring, seriously affect entire matching performance.
The content of the invention
The technical problems to be solved by the invention are to provide multi-mode matching regular expressions method, pass through two layers of filtering side Formula improves filtering rate and filter effect, and then the stability to ensure matching performance.
The present invention also provides multi-mode matching regular expressions device, to ensure the realization of the above method in practice And application.
On the one hand, the present invention provides multi-mode matching regular expressions method, this method includes:
First layer filtering characteristic set pair data to be matched according to pre-establishing are filtered to obtain the number of first layer filtering According to burst and the precise character string of hit;The first layer filtering characteristic collection includes:One extracted from each regular expression Length is more than the precise character string of predetermined threshold value;
Corresponding regular expression superset is searched according to the precise character string of the hit, is surpassed according to the regular expression The data fragmentation that first layer described in set pair filters carries out the second layer and the data fragmentation of second layer filtering is obtained by filtration and hits just Then expression formula superset;The regular expression superset is according to the precise character string of regular expression and the logic of fuzzy strings The expression formula of relation composition;
Corresponding regular expression is determined according to the regular expression superset of the hit, utilizes the regular expression pair The data fragmentation of the second layer filtering matches.
Preferably, the first layer filtering characteristic collection is established in the following manner:
Each regular expression is split to obtain corresponding precise character string and fuzzy strings;
The precise character string that length is more than predetermined threshold value is selected from the corresponding precise character string of each regular expression, it will The precise character string of selection is combined into alternative characters trail;
It is selected according to the priority orders of precise character string from the alternative characters trail for each regular expression One precise character string is combined into first layer filtering characteristic collection.
Preferably, each regular expression is split to obtain corresponding precise character string and fuzzy strings described Afterwards, the method further includes:
Merge by being determined of fuzzy strings, and with adjacent precise character string burst.
Preferably, the priority orders according to precise character string are directed to each canonical from the alternative characters trail Expression formula selects a precise character string to be combined into the first filtering characteristic collection, including:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression Into first layer filtering characteristic collection.
Preferably, the regular expression superset generates in the following manner:
For being split to obtain precise character string and fuzzy strings to regular expression, replaced using logical relation symbol For the fuzzy strings, regular expression superset is generated according to the precise character string and the logical relation symbol;It is described Logical relation symbol is used to characterize the logical relation between fuzzy strings precise character string adjacent thereto.
Another aspect, the present invention provides a kind of multi-mode matching regular expressions device, which includes:
First layer filter element, for being filtered according to the first layer filtering characteristic set pair data to be matched pre-established Obtain the data fragmentation of first layer filtering and the precise character string of hit;The first layer filtering characteristic collection includes:From it is each just Then a length of expression formula extraction is more than the precise character string of predetermined threshold value;
Second layer filter element, for searching corresponding regular expression superset according to the precise character string of the hit, The second layer is carried out to the data fragmentation that the first layer filters according to the regular expression superset, second layer filtering is obtained by filtration Data fragmentation and hit regular expression superset;The regular expression superset is the precise character according to regular expression The expression formula that the logical relation of string and fuzzy strings forms;
Matching unit determines corresponding regular expression for the regular expression superset according to the hit, utilizes institute The data fragmentation that regular expression filters the second layer is stated to match.
Preferably, described device further includes:
First layer filtering characteristic collection generation unit, for generating the first layer filtering characteristic collection;
The first layer filtering characteristic collection generation unit, including:
String segmentation subelement, for being split to obtain corresponding precise character string and mould to each regular expression Paste character string;
Alternative characters trail generates subelement, for selecting length from the corresponding precise character string of each regular expression More than the precise character string of predetermined threshold value, the precise character string of selection is combined into alternative characters trail;
First filtering characteristic collection generates subelement, for according to the priority orders of precise character string from the alternative characters For each regular expression a precise character string is selected to be combined into first layer filtering characteristic collection in trail.
Preferably, the first filtering characteristic collection generation unit further includes:
Determinization subelement, for by being determined of fuzzy strings, and with adjacent precise character string burst Merge.
Preferably, the first layer filtering characteristic collection generation subelement is specifically used for:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression Into first layer filtering characteristic collection.
Preferably, described device further includes:
Regular expression superset generation unit, for being split to obtain precise character string and fuzzy word to regular expression Symbol string, using fuzzy strings described in logical relation symbolic, according to the precise character string and the logical relation symbol Generate regular expression superset;The logical relation symbol is used to characterize fuzzy strings precise character string adjacent thereto Between logical relation.
Compared with prior art, the present invention has the advantages that:
The present invention is proposed improves filter effect by way of two layers of filtering, particularly according to the pre-established One layer of filtering characteristic set pair data to be matched are filtered to obtain the data fragmentation of first layer filtering and the precise character string of hit; The first layer filtering characteristic collection includes:The length extracted from each regular expression is more than the precise character of predetermined threshold value String;Here first layer filtering be it is more few better according to character string quantity, string length selected the characteristics of the longer the better the One filtering characteristic collection only selects a character string from each regular expression;Accurate string of the prior art is different from, so as to So that first layer filtering can be played the role of reducing clean data percent of pass, primarily serve " rate of filtration maximization ".
When first layer filters completion, second layer filtering is then carried out, specifically according to the precise character string of the hit Corresponding regular expression superset is searched, the data fragmentation that the first layer filters is carried out according to the regular expression superset The data fragmentation of second layer filtering and the regular expression superset of hit is obtained by filtration in the second layer;The regular expression superset is The expression formula formed according to the logical relation of the precise character string of regular expression and fuzzy strings;Since second layer filtering is adopted It is regular expression superset, descriptive power has been sufficiently close to original regular expression, can accomplish to arrange as far as possible Except unmatched data, to filter out the data fragmentation closest to original regular expression, substantial amounts of muddy data is avoided to enter Matching regular expressions stage to the end, second layer filtering primarily serve filter effect and maximumlly act on, and to carry indirectly High matching efficiency, while multi-mode matching is decomposed into the single mode matching in different scenes (different data burst), and then avoid The problem of DFA volume expansions or excessively slow NFA matching speeds.Finally, the data fragmentation second layer being obtained by filtration, it is right using its The regular expression answered makees wall scroll matching.Therefore, technical scheme by two layers filter type improve filtering rate and Filter effect, and then the stability to ensure matching performance in the case where ensureing aggressive data by filtering, avoid as far as possible Clean data passes through.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present application, for For those of ordinary skill in the art, without having to pay creative labor, it can also be obtained according to these attached drawings His attached drawing.
Fig. 1 is the flow chart of the multi-mode matching regular expressions embodiment of the method for the present invention;
Fig. 2 is the flow chart of the generation method of first layer filtering characteristic collection provided by the invention;
Fig. 3 is the structure chart of the multi-mode matching regular expressions device embodiment of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, the technical solution in the embodiment of the present application is carried out clear, complete Site preparation describes, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, those of ordinary skill in the art are obtained every other without making creative work Embodiment shall fall in the protection scope of this application.
The application can be used in numerous general or special purpose computing device environment or configuration.Such as:Personal computer, service Device computer, handheld device or portable device, laptop device, multi-processor device, including any of the above device or equipment Distributed computing environment etc..
The application can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these distributed computing environment, by Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage device.
With reference to figure 1, Fig. 1 is the flow chart of the multi-mode matching regular expressions embodiment of the method for the present invention, and this method can To include:
S101 is filtered to obtain first layer filtering according to the first layer filtering characteristic set pair data to be matched pre-established Data fragmentation and hit precise character string;The first layer filtering characteristic collection includes:It is extracted from each regular expression One length is more than the precise character string of predetermined threshold value.
Here first layer filtering characteristic collection should previously generate before matching process is realized, the first layer filtering characteristic collection For each regular expression only comprising a precise character string, and the precise character string must is fulfilled for length more than predetermined threshold value Condition.
For how to generate first layer filtering characteristic collection, the invention also provides concrete implementation method, referring to Fig. 2, Fig. 2 It is the flow chart of the generation method of first layer filtering characteristic collection provided by the invention, this method includes:
S201 is split each regular expression to obtain corresponding precise character string and fuzzy strings.
Explanation is needed exist for, regular expression may be different according to its different expression-form of grammer, under normal circumstances Comprising:The elements such as letter, number, quotation mark or special grammar symbol;It, can be by a canonical according to certain segmentation condition Expression formula is divided into two kinds of character string, and both types character string is referred to as:Precise character string and fuzzy strings, The number for splitting obtained both types character string is indefinite.
Inventor considers that precise character string length is longer during realization, the higher spy of the efficiency that first layer filters Point further provides a kind of realization method, in the case where not reducing semantic coverage, as much as possible by precise character String length maximizes, and specific implementation is after above-mentioned S201, increases following steps:
Merge by being determined of fuzzy strings, and with adjacent precise character string burst.Then after carrying out again Continuous step, and subsequent step is that the precise character string after merging is handled accordingly.
After S201 is completed, S202 and S203 is then carried out.
S202 selects the precise character that length is more than predetermined threshold value from the corresponding precise character string of each regular expression String, alternative characters trail is combined by the precise character string of selection.
S203 is directed to each regular expression according to the priority orders of precise character string from the alternative characters trail One precise character string of selection is combined into first layer filtering characteristic collection.
Specific implementation when, S203 can be according to the descending order of precise character string length come set priority by High to Low order, then selecting the precise character string of corresponding highest priority for each regular expression, (length is most long Precise character string), with combination producing first layer filtering characteristic collection.
In order to further ensure first layer filter effect, inventor also proposed carrys out adaptability tune with reference to actual filtration result Whole first layer filtering characteristic collection, specific implementation are:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression Into first layer filtering characteristic collection.
Above-mentioned realization method is it is to be understood that in the starting stage, first layer filtering characteristic collection is to directly select each canonical The precise character string (the longest precise character string of length) of the corresponding highest priority of expression formula;During the late stages of developmet, according to first The result of layer filtering and second layer filtering to carry out accommodation to priority orders, if precise character string is in first layer mistake Hit is sent during filter, but is not hit when the second layer filters, at this point, the priority of the precise character string is minimized, And by the priority update of precise character string adjacent with its order in alternative characters trail for the superlative degree.In entire matching process In, using above-mentioned polling mode, the precise character string with higher differentiation degree is allowed to be used as first layer for a long time as much as possible The element of filtering characteristic collection, to ensure that the actual effect of first layer filtering is optimal.
In specific implementation, can be adjusted according to the set-up mode dynamic of above-mentioned priority orders in alternative characters set of strings Precise character string priority sequence, so as to dynamic adjustment first layer filtering characteristic collection so that it is according to actual filtration Situation carries out dynamic change, to meet first layer filtration needs.
In order to be better understood from above-mentioned steps S101, explanation is further explained to it by taking concrete scene as an example below.
In attack detecting application field, the regular expression used under normal circumstances has the feature of apparent discrimination; The present invention is exactly to make full use of the difference of the described data of regular expression and common normal data, to realize preferable filtering Function, to increase substantially detection efficiency.
The essence of above-mentioned first layer filtering is to match the accurate of most apparent discrimination in each regular expression Character string, to ensure that " pure " data need not go deep into detection process into subsequent, also for follow-up step provide it is more accurate, Smaller " muddiness " data of data volume.Under normal circumstances, substantial amounts of network data is " pure " data, then by first layer After filtering, these " pure " data are filtered not further into the subsequent second layer and matching treatment, greatly simplifie follow-up place Reason.Here " pure " data refer to non-suspicious network data, and " muddiness " data refer to suspicious network data.
The precise character string for obtaining a data fragmentation and being hit is possibly filtered out by first layer filtering, also may be used It can obtain multiple data fragmentations and multiple precise character strings being hit, it is also possible to be not to be filled into data fragmentation, not have There is precise character string to be hit;In the case where not being filled into data fragmentation, illustrate that data to be matched do not meet matching rule Then, it is believed that data to be matched are clean datas, then without carrying out follow-up each step, directly terminate this matching process.
When first layer filters out data fragmentation, into subsequent S102 and S103.
Realization the step S102 and S103 of the present embodiment are then described.
S102 searches corresponding regular expression superset, according to the canonical table according to the precise character string of the hit The data fragmentation and life that second layer filtering is obtained by filtration in the second layer are carried out to the data fragmentation that the first layer filters up to formula superset In regular expression superset;The regular expression superset is the precise character string and fuzzy strings according to regular expression Logical relation composition expression formula.
Here regular expression superset should previously generate before second layer matching is realized, each regular expression Regular expression superset there are one corresponding, for how to generate regular expression superset, the invention also provides concrete implementations Method, this method include:
Regular expression is split to obtain precise character string and fuzzy strings, using logical relation symbolic institute Fuzzy strings are stated, regular expression superset is generated according to the precise character string and the logical relation symbol;The logic Relational symbol is used to characterize the logical relation between fuzzy strings precise character string adjacent thereto.
Explanation is needed exist for, superset is the conceptive understanding from data acquisition system, and the superset of a data acquisition system must So logical relation of all elements comprising this data acquisition system and each element.The member that regular expression superset is covered Plain scope is much larger than the elemental range that the regular expression is covered.It can be seen that from the generation method and accorded with using logical relation Number instead of fuzzy strings, since the elemental range that logical relation symbol is covered is much larger than the model that fuzzy strings are covered It encloses, hence the scope for the superset established is much larger than the scope of regular expression.Due to having between regular expression character string Certain ordinal relation has certain data length, in order to establish an appropriate superset of scope, logical relation here Symbol can be order with, with or the forms such as length decision symbol.
The essence of second layer filtering is word matching process and semantic matching process.Institute's predicate matching process refers to and canonical table It is matched up to the precise character string in formula superset;The semantic matching process refers to be determined according to word matching result super with which Collection syntax tree is matched, and is traveled through syntax tree, when syntax tree is hit, is shown that the data fragmentation is filtered by the second layer, depending on It is hit for regular expression superset.Here syntax tree is substantially another display form of regular expression superset, syntax tree It is using the precise character string in superset as terminal node, using the logical relation symbol in superset as nonterminal node.
In order to be better understood from above-mentioned steps S102, explanation is further explained to it by taking concrete scene as an example below.
In attack detecting application field, second layer filtering is carried out again primarily to keeping away after the filtering of above-mentioned first layer Exempt to occur " muddiness " data in large amount of complex network environment, block matching regular expressions link to the end, therefore, second The data that cannot hit mainly are screened in layer filtering, directly by these data distributions to reduce the data of matching regular expressions Amount, to avoid there is the problem of volume expansions problem of DFA or slow NFA matching speeds.
When completing second layer filtering, and being filled into some data fragmentations, S103 is put into be carried out most to data fragmentation Matching confirms eventually.
S103 determines corresponding regular expression according to the regular expression superset of the hit, utilizes the canonical table The data fragmentation filtered up to formula to the second layer matches.
Since most of data in data to be matched are " pure " data, in first layer filtering and second layer filtering During processing, most of data distribution is fallen, it is considerably less to enter the data of final tache, and because when the second layer filters It can determine that data fragmentation is matched with which regular expression most probable, therefore, multi-mode matching decomposed different The single mode matching (the corresponding single mode matching of data fragmentation) of scene, based on this, when S103 is implemented, may be employed non-determined has Automatic machine (NFA) method of limit realizes matching, can also realize matching by the way of deterministic stresses (DFA).
In addition, inventor is on the basis of above-mentioned two layers filtering, it is also proposed that step S103 can be only to regular expression Fuzzy strings part, which carries out structure, can complete final matching.It is primarily due in first layer filtering and second layer filtering When have determined that data fragmentation precise character string matches, therefore, can also be only to fuzzy strings in the last one link Part is matched.
Based on foregoing description, above-mentioned S103 can have following two realization methods in specific implementation:
(1) plants mode:The hit is being stated by constructing non-determined finite automata or deterministic stresses just The then fuzzy strings in the corresponding regular expression of expression formula superset, using construction non-determined finite automata or determined The data fragmentation that limit automatic machine filters the second layer matches.
(2) plant mode:The hit is being stated by constructing non-determined finite automata or deterministic stresses just The then corresponding regular expression of expression formula superset, using the non-determined finite automata or deterministic stresses of construction to second The data fragmentation of layer filtering is matched.
The above-described generation method with the exemplary first filtering characteristic collection of Fig. 2, below by specific example to it The realization process of S201 is further explained explanation, is mainly illustrated by taking PCRE regular expression grammers as an example.
Regular expression is split, mainly using fuzzy factor as separator, regular expression is divided into essence True character string and fuzzy strings.The fuzzy factor of regular expression refers to be not sure to the symbol for representing an ascii character Number and combinations thereof.
According to the grammar property of PCRE grammers (regular expression grammer), fuzzy factor can be defined as follows:
(1) metacharacter:" ", " ^ ", " $ ".
(2) branch accords with:“|”.
(3) group separator:(..).
(4) suffix symbol or phrase:"+", "", " * ", { m, n }, { m, }, n }, { m }.
(5) character phrase, such as [a-z], [^0-9], [[:alpha:]] etc..
(6) quotation mark, such as regular expression are described as ([D dSsW w B beZ A ab tnvfr0-9]) | ([Pp] (({ [a-zA-Z_] * }) | ([a-zA-Z]))), then d, f, n, pF, p { N am e } for quote Symbol.
(7) high-level syntax's symbol of inertia matching and modification match pattern, such as:" *", "+", "", " m, n}", " Q ... E ", " (I) ", " (- i) ", " (S) ", " (- s) ", " (M) ", " (- m) " etc..
In specific implementation, fuzzy factor can also be the combining form of any of the above form.
Secondly, ambiguity in definition decision condition:
(a) for suffix symbol or phrase, a fuzzy strings, such as regular expression are merged into its grammer prefix Abc+d is divisible to obtain fuzzy strings c+;
(b) for switch, if being used in combination with group separator, by all branches in grouping as an ambiguous characters String, does not otherwise split;
(c) remaining fuzzy factor is as separator, itself is as a fuzzy strings.
(d) it is all to be blurred the separated character string of substring, as accurate string character string.
Finally, by the fuzzy judgement condition of definition, regular expression is scanned, is divided into the precise character string of order And fuzzy strings, and the position relationship between reserved character string.
It is exemplified below:
Example 1:Regular expression
" (20 [01] [and x09- x0d-~] * (A U T H IN FO U SE R | new s) finger " it is divisible Into " 20, [01] [ x09- x0d-~] * (A U T H IN FO U SE R | new s), 3 characters of finger " String.First " 20 " and the last one " finger " are precise character string, intermediate " [01] [ x09- x0d-~] * (A U T H IN FO U SE R | new s) " it is fuzzy strings.
It upon splitting, can also be further to the processing of being determined of fuzzy strings, followed by citing side The determinization process of fuzzy strings is explained in formula.
Determinization method is the further determining that of fuzzy strings for some specific compositions, and implementation principle is:Behaviour Make simply, not reducing semantic coverage, adjacent precise character string merges, the character string controllable quantity of segmentation.For example, it can divide Step is implemented:
(1) for including the fuzzy strings of suffix symbol or phrase, determined to turn to the character string for carrying precise character, Such as " a { m, n }, a { m }, a { m, } " is determined to turn to " a ... a { 0, n-m }, a ... a { 0 }, a ... a* " wherein " ... " expression a exhibitions It opens m times,;Precise character after determinization is merged with the precise character string that adjacent segmentation obtains.
Example 2:" abc { 3,10 } de " is divided into " ab, c { 3,10 }, de ", by fuzzy strings determinization to regular expression C { 3,10 } is determined to turn to cccc { 0,7 }, then its precise character string " ab " merging adjacent with the left side is obtained into " abccc, c { 0,7 }, de ".
For another example " a+ " is determined to turn to " aa* ", the precise character after determinization and adjacent segmentation are obtained Precise character string merges.
Example 3:Regular expression " abc+de " is divided into that " fuzzy strings therein " c+ " are determined to turn to by ab, c+, de " " cc* " according still further to the character string position relationship after segmentation, itself and adjacent precise character string " ab " is merged, obtained " abc, c*, de ";
Example 4:Regular expression " ab (cd)+de " be divided into " ab, (cd)+, de ", by fuzzy strings therein " (cd) + " determine to turn to (cd) (cd) *, it finally merges to obtain " ab, (cd), (cd) *, de ".
It (2), can if internal do not include branch's symbol " | " and suffix symbol or phrase for the fuzzy strings comprising group separator It directly to delete group separator, and is merged as far as possible with adjacent precise character string, in this case, above-mentioned example 4 can be further It determines to turn to " abcd, (cd) *, de ".
(3) fuzzy strings for being accorded with comprising branch, it tries extraction common prefix and suffix character string, and as far as possible with Adjacent precise character string merges.
Example 5:Regular expression " x (abcde | abcfe) y " is divided into that " x, (abcde | abcfe) determine to turn to after y " " xabc, (and d | f), ey ".
If without common prefix and suffix character string, but numbers of branches is less, and each branch has longer accurate word Symbol string can carry out branch's expansion to fuzzy strings.
Such as:Above-mentioned example 1 is carried out determine after branch expansion turning to " 20, [01] [ x09- x0d-~] *, A U T H IN FO U SE R, finger " or " 20, [01] [ x09- x0d-~] *, news, finger ".
By the specific implementation of above-mentioned several determinizations, can to some further determining that of fuzzy strings, To achieve the purpose that extend precise character string length.
In addition, the generation method on above-described regular expression superset, below with PCRE grammer regular expressions Exemplified by, to describe how to generate regular expression superset.
The generation method of regular expression superset described above includes:Regular expression is split to obtain accurate Character string and fuzzy strings, using fuzzy strings described in logical relation symbolic, according to the precise character string and institute State logical relation symbol generation regular expression superset;The logical relation symbol is used to characterize the fuzzy strings and its phase Logical relation between adjacent precise character string.
The partitioning portion that this method is related to may refer to obtaining precise character to the corresponding description section of S201 above After string and fuzzy strings, can also further fuzzy strings be made with determinization processing, delete the mould for failing determinization Character string is pasted, and records the data length scope of the fuzzy strings.Then will, according to the home position relation between character string and Logical relation is reassembled into new expression formula, and the data acquisition system of this new expression formula description is exactly the super of former regular expression Collection, referred to as regular expression superset, a regular expression correspond to a superset in the present invention.
Regular expression superset allows the data of successful match more extensive, and eliminates space in former regular expression and answer Miscellaneous degree and all relatively higher fuzzy factor multi-mode matching part of time complexity, have higher matching efficiency, thus can To improve the performance of second layer filtering.
The regular expression of different grammers can be directed in practical applications, generated with reference to the characteristics of regular expression pair The superset answered can be directed to the common corresponding logical symbol of fuzzy strings formal definition or decision symbol, in general, until It to include less:Order with, with or, the logical relations symbol such as length decision symbol.
A kind of example is given below:
(1) " sequentially with " relation between character string is represented with " .. ", example 6:The expression formula of regular expression abc.*def surpasses Collect for " abc " .. " def ", express first matched character string " abc " and match again " def ".
(2) the "or" relation between character string is represented with " | ".
Such as:Precedent 1 expression formula superset for " 20 " .. " A U T H IN FO U SE R " .. " finger " or " 20 " .. " news " .. " finger ", introduce or symbol after be further represented as:(“20”..“A U T H IN FO U SE R”..“finger”)|(“20”..“news”..“finger”)。
(3) the "AND" relation between character string is represented with " & ".
Preferably, (" A " .. " B ") | (" B " .. " A ") can be optimized for " A " & " B ", example 7:Regular expression " abc.* Def " | the expression formula superset of " def " .* " abc " is " abc " & " def ".
(4) as general expression formula, character string promotes priority level by ().Or priority need not promoted In the case of, it is grouped for explicit character string.For example it is used with reference to length decision symbol.Length decision symbol includes:>, > =, <, <=,==.
Example 8:The expression formula superset of regular expression " abc { 3 } def " for (" abc " .. " def ")==8, expression first match It is 8 that " abc " matches " def " and matching area size again.
It is as shown in the table to summarize example above:
Described above is multi-mode matching regular expressions method provided by the invention, below to multimode provided by the invention Formula matching regular expressions device is explained.
With reference to figure 3, Fig. 3 is the structure chart of the multi-mode matching regular expressions device embodiment of the present invention, which can To include:
First layer filter element 301, for being carried out according to the first layer filtering characteristic set pair pre-established data to be matched The data fragmentation of first layer filtering and the precise character string of hit is obtained by filtration;The first layer filtering characteristic collection includes:From every One length of a regular expression extraction is more than the precise character string of predetermined threshold value;
Second layer filter element 302 surpasses for searching corresponding regular expression according to the precise character string of the hit Collection carries out the second layer to the data fragmentation that the first layer filters according to the regular expression superset and second layer mistake is obtained by filtration The data fragmentation of filter and the regular expression superset of hit;The regular expression superset is the accurate word according to regular expression The expression formula that the logical relation of symbol string and fuzzy strings forms;
Matching unit 303 for determining corresponding regular expression according to the regular expression superset of the hit, utilizes The data fragmentation that the regular expression filters the second layer matches.
Preferably, described device further includes:
First layer filtering characteristic collection generation unit, for generating the first layer filtering characteristic collection;
The first layer filtering characteristic collection generation unit, including:
String segmentation subelement, for being split to obtain corresponding precise character string and mould to each regular expression Paste character string;
Alternative characters trail generates subelement, for selecting length from the corresponding precise character string of each regular expression More than the precise character string of predetermined threshold value, the precise character string of selection is combined into alternative characters trail;
First filtering characteristic collection generates subelement, for according to the priority orders of precise character string from the alternative characters For each regular expression a precise character string is selected to be combined into first layer filtering characteristic collection in trail.
Preferably, the first filtering characteristic collection generation unit further includes:
Determinization subelement, for by being determined of fuzzy strings, and with adjacent precise character string burst Merge.
Preferably, the first layer filtering characteristic collection generation subelement is specifically used for:
Each corresponding essence of regular expression in the alternative characters trail is set according to string length magnitude relationship The priority orders of true character string, and the knot that the priority orders filter in use according to first layer filtering and the second layer Fruit is adjusted;The precise character string of highest priority is selected to combine from the corresponding precise character string of each regular expression Into first layer filtering characteristic collection.
Preferably, described device further includes:
Regular expression superset generation unit, for being split to obtain precise character string and fuzzy word to regular expression Symbol string, using fuzzy strings described in logical relation symbolic, according to the precise character string and the logical relation symbol Generate regular expression superset;The logical relation symbol is used to characterize fuzzy strings precise character string adjacent thereto Between logical relation.
The present invention is proposed improves filter effect by way of two layers of filtering, particularly according to the pre-established One layer of filtering characteristic set pair data to be matched are filtered to obtain the data fragmentation of first layer filtering and the precise character string of hit; The first layer filtering characteristic collection includes:The length extracted from each regular expression is more than the precise character of predetermined threshold value String;Here the precise character string of first layer filtering characteristic collection protection is extracted according to length scale, is different from the prior art In accurate string so that first layer filtering can be played the role of reducing clean data percent of pass, due to first layer filter it is special Precise character string in collection has the characteristics that each regular expression only selects a precise character string to participate in, and can ensure Filtering velocity rate.
When first layer filters completion, second layer filtering is then carried out, specifically according to the precise character string of the hit Corresponding regular expression superset is searched, the data fragmentation that the first layer filters is carried out according to the regular expression superset The data fragmentation of second layer filtering and the regular expression superset of hit is obtained by filtration in the second layer;The regular expression superset is The expression formula formed according to the logical relation of the precise character string of regular expression and fuzzy strings;Since second layer filtering is adopted It is regular expression superset, descriptive power has been sufficiently close to original regular expression, can accomplish to arrange as far as possible Except unmatched data, to filter out the data fragmentation closest to original regular expression, avoid substantial amounts of " muddiness " data into Enter the matching regular expressions stage to the end, to improve matching efficiency indirectly, while multi-mode matching is decomposed into different fields Single mode matching on scape (different data burst), and then the problem of avoid DFA volume expansions or excessively slow NFA matching speeds.Most Afterwards, the data fragmentation second layer being obtained by filtration makees wall scroll matching using its corresponding regular expression.Therefore, it is of the invention Technical solution improves filtering rate and filter effect, and then the stability to ensure matching performance by two layers of filter type, In the case of ensureing aggressive data by filtering, passing through for clean data is avoided as far as possible.
It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight Point explanation is all difference from other examples, and just to refer each other for identical similar part between each embodiment. For device class embodiment, since it is basicly similar to embodiment of the method, so description is fairly simple, related part ginseng See the part explanation of embodiment of the method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only include that A little elements, but also including other elements that are not explicitly listed or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except also there are other identical elements in the process, method, article or apparatus that includes the element.
Multi-mode matching regular expressions method and device provided herein is described in detail above, herein In apply specific case the principle and implementation of this application are described, the explanation of above example is only intended to sides Assistant solves the present processes and its core concept;Meanwhile for those of ordinary skill in the art, the think of according to the application Think, in specific embodiments and applications there will be changes, in conclusion this specification content should not be construed as pair The limitation of the application.

Claims (10)

  1. A kind of 1. multi-mode matching regular expressions method, which is characterized in that this method includes:
    First layer filtering characteristic set pair data to be matched according to pre-establishing are filtered to obtain the data point of first layer filtering Piece and the precise character string of hit;The first layer filtering characteristic collection includes:The length extracted from each regular expression More than the precise character string of predetermined threshold value;
    Corresponding regular expression superset is searched according to the precise character string of the hit, according to the regular expression superset pair The data fragmentation of the first layer filtering carries out the second layer and the data fragmentation of second layer filtering and the canonical table of hit is obtained by filtration Up to formula superset;The regular expression superset is according to the precise character string of regular expression and the logical relation of fuzzy strings The expression formula of composition;
    Corresponding regular expression is determined according to the regular expression superset of the hit, using the regular expression to described The data fragmentation of second layer filtering matches.
  2. 2. according to the method described in claim 1, it is characterized in that, the first layer filtering characteristic collection is built in the following manner It is vertical:
    Each regular expression is split to obtain corresponding precise character string and fuzzy strings;
    The precise character string that length is more than predetermined threshold value is selected from the corresponding precise character string of each regular expression, will be selected Precise character string be combined into alternative characters trail;
    According to the priority orders of precise character string one is selected for each regular expression from the alternative characters trail Precise character string is combined into first layer filtering characteristic collection.
  3. 3. according to the method described in claim 2, it is characterized in that, each regular expression is split to obtain pair described After the precise character string and fuzzy strings answered, the method further includes:
    Merge by being determined of fuzzy strings, and with adjacent precise character string burst.
  4. 4. according to the method in claim 2 or 3, which is characterized in that the priority orders according to precise character string from For each regular expression a precise character string is selected to be combined into the first filtering characteristic collection in the alternative characters trail, wrapped It includes:
    The corresponding accurate word of each regular expression in the alternative characters trail is set according to string length magnitude relationship Accord with string priority orders, and the priority orders in use according to first layer filtering and the second layer filter result into Row adjustment;The precise character string of highest priority is selected to be combined into the from the corresponding precise character string of each regular expression One layer of filtering characteristic collection.
  5. 5. according to the method described in claim 1, it is characterized in that, the regular expression superset generates in the following manner:
    Regular expression is split to obtain precise character string and fuzzy strings, using mould described in logical relation symbolic Character string is pasted, regular expression superset is generated according to the precise character string and the logical relation symbol;The logical relation Symbol is used to characterize the logical relation between fuzzy strings precise character string adjacent thereto.
  6. 6. a kind of multi-mode matching regular expressions device, which is characterized in that the device includes:
    First layer filter element, for being filtered to obtain according to the first layer filtering characteristic set pair data to be matched pre-established The data fragmentation of first layer filtering and the precise character string of hit;The first layer filtering characteristic collection includes:From each canonical table The length extracted up to formula is more than the precise character string of predetermined threshold value;
    Second layer filter element, for searching corresponding regular expression superset according to the precise character string of the hit, according to The regular expression superset carries out the data fragmentation that the first layer filters the number that second layer filtering is obtained by filtration in the second layer According to burst and the regular expression superset of hit;The regular expression superset be according to the precise character string of regular expression and The expression formula of the logical relation composition of fuzzy strings;
    Matching unit determines corresponding regular expression for the regular expression superset according to the hit, using it is described just The data fragmentation that then expression formula filters the second layer matches.
  7. 7. device according to claim 6, which is characterized in that described device further includes:
    First layer filtering characteristic collection generation unit, for generating the first layer filtering characteristic collection;
    The first layer filtering characteristic collection generation unit, including:
    String segmentation subelement, for being split to obtain corresponding precise character string and fuzzy word to each regular expression Symbol string;
    Alternative characters trail generates subelement, for length to be selected to be more than from the corresponding precise character string of each regular expression The precise character string of selection is combined into alternative characters trail by the precise character string of predetermined threshold value;
    First filtering characteristic collection generates subelement, for according to the priority orders of precise character string from the alternative characters trail In for each regular expression select a precise character string be combined into first layer filtering characteristic collection.
  8. 8. device according to claim 7, which is characterized in that the first filtering characteristic collection generation unit further includes:
    Determinization subelement, for merging by being determined of fuzzy strings, and with adjacent precise character string burst.
  9. 9. the device according to claim 7 or 8, which is characterized in that the first layer filtering characteristic collection generation subelement tool Body is used for:
    The corresponding accurate word of each regular expression in the alternative characters trail is set according to string length magnitude relationship Accord with string priority orders, and the priority orders in use according to first layer filtering and the second layer filter result into Row adjustment;The precise character string of highest priority is selected to be combined into the from the corresponding precise character string of each regular expression One layer of filtering characteristic collection.
  10. 10. device according to claim 6, which is characterized in that described device further includes:
    Regular expression superset generation unit, for being split to obtain precise character string and ambiguous characters to regular expression String using fuzzy strings described in logical relation symbolic, is given birth to according to the precise character string and the logical relation symbol Into regular expression superset;The logical relation symbol for characterize fuzzy strings precise character string adjacent thereto it Between logical relation.
CN201510262867.9A 2015-05-21 2015-05-21 A kind of multi-mode matching regular expressions method and device Active CN104899264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510262867.9A CN104899264B (en) 2015-05-21 2015-05-21 A kind of multi-mode matching regular expressions method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510262867.9A CN104899264B (en) 2015-05-21 2015-05-21 A kind of multi-mode matching regular expressions method and device

Publications (2)

Publication Number Publication Date
CN104899264A CN104899264A (en) 2015-09-09
CN104899264B true CN104899264B (en) 2018-05-29

Family

ID=54031927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510262867.9A Active CN104899264B (en) 2015-05-21 2015-05-21 A kind of multi-mode matching regular expressions method and device

Country Status (1)

Country Link
CN (1) CN104899264B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106911647A (en) * 2015-12-23 2017-06-30 北京奇虎科技有限公司 A kind of method and apparatus for detecting network attack
CN106911649A (en) * 2015-12-23 2017-06-30 北京奇虎科技有限公司 A kind of method and apparatus for detecting network attack
CN106202004B (en) * 2016-07-13 2019-10-11 上海轻维软件有限公司 Combined data cutting method based on regular expressions and separator
CN108062295B (en) * 2016-11-09 2021-11-05 北京国双科技有限公司 Content processing method and device
CN107633074B (en) * 2017-09-22 2020-06-09 咪咕文化科技有限公司 Information extraction method and device and storage medium
CN107992481B (en) * 2017-12-25 2021-05-04 鼎富智能科技有限公司 Regular expression matching method, device and system based on multi-way tree
CN108920463A (en) * 2018-06-29 2018-11-30 北京奇虎科技有限公司 A kind of segmenting method and system based on network attack
CN109871502B (en) * 2019-01-18 2020-10-30 北京赛思信安技术股份有限公司 Stream data regular matching method based on Storm
CN110096626A (en) * 2019-03-18 2019-08-06 平安普惠企业管理有限公司 Processing method, device, equipment and the storage medium of contract text data
CN111125693A (en) * 2019-12-18 2020-05-08 杭州安恒信息技术股份有限公司 Equipment safety protection method, device and equipment
CN111556014B (en) * 2020-03-24 2022-07-15 华东电力试验研究院有限公司 Network attack intrusion detection method adopting full-text index

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100485684C (en) * 2006-10-08 2009-05-06 中国科学院软件研究所 Text content filtering method and system
CN101685502A (en) * 2008-09-24 2010-03-31 华为技术有限公司 Mode matching method and device
CN102521357A (en) * 2011-12-13 2012-06-27 曙光信息产业(北京)有限公司 System and method for achieving accurate matching of texts by automaton
CN102523219B (en) * 2011-12-16 2015-01-14 清华大学 Regular expression matching system and regular expression matching method
CN102857493B (en) * 2012-06-30 2015-07-08 华为技术有限公司 Content filtering method and device

Also Published As

Publication number Publication date
CN104899264A (en) 2015-09-09

Similar Documents

Publication Publication Date Title
CN104899264B (en) A kind of multi-mode matching regular expressions method and device
Dharmapurikar et al. Fast and scalable pattern matching for content filtering
Kumar et al. Advanced algorithms for fast and scalable deep packet inspection
US7464254B2 (en) Programmable processor apparatus integrating dedicated search registers and dedicated state machine registers with associated execution hardware to support rapid application of rulesets to data
CN101154228A (en) Partitioned pattern matching method and device thereof
CN102857493A (en) Content filtering method and device
Le et al. A memory-efficient and modular approach for large-scale string pattern matching
CN102163234A (en) Equipment and method for error correction of query sequence based on degree of error correction association
JP6592310B2 (en) Semiconductor device
CN103188267B (en) A kind of protocol analysis method based on DFA
CN105045808B (en) A kind of compound rule collection matching process and system
CN106909630A (en) Filtering sensitive words method and system based on dynamic dictionary
EP3077922B1 (en) Method and apparatus for generating a plurality of indexed data fields
CN109800337B (en) Multi-mode regular matching algorithm suitable for large alphabet
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
CN100530194C (en) Key words matching method and system
CN103324886A (en) Method and system for extracting fingerprint database in network intrusion detection
CN102298618B (en) Method for obtaining matching degree to execute corresponding operations and device and equipment
CN110505322A (en) A kind of IP address section lookup method and device
CN117763077A (en) Data query method and device
CN102521357A (en) System and method for achieving accurate matching of texts by automaton
CN107038452A (en) Telephone number recognition methods and device
CN100483402C (en) Programmable rule processing apparatus for conducting high speed contextual searches & characterzations of patterns in data
CN104407849B (en) A kind of finite automaton generation method with asterisk wildcard regular expression
CN106980653B (en) DFA compression method and device, regular expression matching method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant