Specific implementation mode
In the prior art, it when using regular expression matching text, needs each regular expression being compiled into
Corresponding finite automata, then all texts are filtered using each finite automata successively, and this specification will just
Then expression formula is combined, and obtains the regular expression after a small amount of combination, using the regular expression after combination to text into
Row filtering, can be improved matching efficiency..
In order to make those skilled in the art more fully understand the technical solution in this specification one or more embodiment,
Below in conjunction with the attached drawing in this specification one or more embodiment, to the technology in this specification one or more embodiment
Scheme is clearly and completely described, it is clear that and described embodiment is only this specification a part of the embodiment, rather than
Whole embodiments.The embodiment of base in this manual, those of ordinary skill in the art are not before making creative work
The every other embodiment obtained is put, the range of this specification protection should be all belonged to.
Fig. 1 is the process schematic using regular expression matching that this specification provides, and specifically includes following steps:
S100:Determine each first regular expression.
In the present specification, regular expression can be handled by regular expression engine, such as compiling uses regular expressions
Formula such as matches at the text.
The first regular expression described in this specification refer to for text carry out it is matched, do not carry out any processing
Original regular expression.In practical application, there may be more than ten even tens for these first regular expressions.Such as,
Ab (c | d), ab (e | f), ab (g | h) etc..
S102:Each first regular expression is combined, obtains at least one second regular expression, described second just
Then the quantity of expression formula is less than the quantity of the first regular expression.
In the present specification, each first regular expression can be combined by regular expression engine, it also can be by other
Hardware or software each first regular expression is combined.
In combination, the supported grammer of regular expression engine can be used, each first regular expression is combined, as long as group
The quantity of the second regular expression obtained after conjunction is less than the quantity of the first regular expression.Specifically, can be by each first
Regular expression is connected with random order, passes through specified concatenation character phase between each two adjacent first regular expression
Even, the second regular expression is obtained.The specified concatenation character can be specifically " | ".
Continue to use the example above, it is assumed that each first regular expression is ab (c | d), ab (e | f), ab (g | h), then can by this three
A first regular expression is combined as (ab (c | d)) | (ab (e | f)) and ab (g | h), as two obtained the second regular expressions
Formula.Certainly, these three the first regular expressions can be also combined as to second regular expression (ab (c | d)) | (ab (e | f))
|(ab(g|h))。
S104:Each second regular expression is compiled as finite automata.
It, can be by regular expression engine, using specified algorithm by each second canonical table in this specification embodiment
It is compiled as finite automata up to formula.Specifically, (Thompson's construction) algorithm can be constructed by thompson, it will
Each second regular expression is compiled as corresponding non deterministic finite automaton (Nondeterministic Finite
Automaton, NFA), and merge the node repeated in the NFA.Alternatively, can also be calculated by Powerset construction
Each second regular expression is compiled as corresponding deterministic finite automaton (Deterministic Finite by method
Automaton, DFA), and merge the node repeated in the DFA.Certainly, other compiler algorithm compilings second can also be used just
Then expression formula, as long as can finite automata be compiled as the second regular expression.
Wherein, it is canonical table to be either compiled into the step of NFA is still compiled into DFA, merges the node wherein repeated
It is completed up to formula engine when compiling the second regular expression by corresponding algorithm, the purpose is to be multiplexed the section of repetition as possible
Point, illustrates by taking NFA as an example below, as shown in Figures 2 and 3.
Fig. 2 is the schematic diagram that each first regular expression is compiled as to NFA respectively.In fig. 2, if not combining three
One regular expression ab (c | d), ab (e | f), ab (g | h), then need respectively to be compiled these three first regular expressions,
Corresponding 3 three NFA are obtained, then text is matched using these three NFA respectively, this is exactly the method for the prior art.
Fig. 3 is that will combine the schematic diagram that the second regular expression that each first regular expression obtains is compiled as NFA.Scheming
In 3, since three the first regular expression ab (c | d), ab (e | f), ab (g | h) have been combined into a second canonical table
Up to formula (ab (c | d)) | (ab (e | f)) | (ab (g | h)), therefore, using thompson construction algorithm to second regular expression
When being compiled, it is found that there is the node " a " and " b " repeated, therefore, according to thompson construction algorithm, canonical table in NFA
It can merge these nodes repeated up to formula engine, obtain NFA as shown in Figure 3.NFA as shown in Figure 3 is finally reused to text
This is matched, and each text three times, need to only match once, and due to having been incorporated in used NFA without matching
The node repeated, therefore can further promote matching efficiency.
S106:Matched text is treated using the finite automata to be matched, and result is obtained.
By the above method, the second a small amount of regular expressions have been obtained since each first regular expression to be combined
Formula, therefore compared with the prior art, the quantity for the finite automata that the above method that this specification provides finally uses is less than existing
There are the quantity for the finite automata that technology uses, the number that each text is matched as a result, to be reduced, improve using canonical
The efficiency of expression formula matched text.Moreover, when the second regular expression is compiled into finite automata, since compiler algorithm can be certainly
Dynamic circuit connector and the node repeated, therefore can further promote matching efficiency.
May be the corresponding processing strategy of each first regular expression setting in addition, in practical application scene,
When being matched to a text, if gone out using some first regular expression matching as a result, if can use first canonical
The corresponding processing strategy of expression formula, handles the text and/or the result.Therefore, in order to using group as shown in Figure 1
After the method matched text for closing each first regular expression, it still is able to use above-mentioned processing method to text and/or matches
Result handled, in the present specification, combine each first regular expression when, can by " by the first regular expression make
For the subexpression of the second regular expression " method, each first regular expression is combined to obtain the second regular expressions
Formula, that is, after combination, each first regular expression is the subexpression of second regular expression.Such as, second in upper example
Regular expression (ab (c | d)) | (ab (e | f)) | (ab (g | h)) in three subexpression ab (c | d), ab (e | f), ab (g |
H) three the first regular expressions before actually combining.
It further, can be before each second regular expression be compiled as finite automata, just for each second
Then expression formula, for the one-to-one mark of each subexpression setting in second regular expression.Also be equivalent to be for
Each of include that the first regular expression is provided with one-to-one mark in second regular expression.
Specifically, with each subexpression in the second regular expression, mark can be capture group correspondingly
(Capturing Group), capture group are substantially that the subexpression in the second regular expression is saved in digital number
Or (be typically maintained in memory) in the group explicitly named, facilitate and quotes below.
After being provided with corresponding capture group for each subexpression in the second regular expression, to the second regular expressions
When formula is compiled, each of finite automata compiled out is final to be received state and will be corresponded with each capture group.
As shown in Figure 4.
Fig. 4 is the state transition graph illustrated by taking the corresponding finite automatas of regular expression ab (c | d) as an example, in Fig. 4,
State 0,1,2,3,4 be all receive state, but only stateful 3 and 4 be it is final receive state (also referred to as terminal), state turns
It is as follows to change table:
|
a |
b |
c |
d |
State 0 |
State 1 |
Do not receive |
Do not receive |
Do not receive |
State 1 |
Do not receive |
State 2 |
Do not receive |
Do not receive |
State 2 |
Do not receive |
Do not receive |
State 3 (terminal) |
State 4 (terminal) |
State 3 |
Do not receive |
Do not receive |
Do not receive |
Do not receive |
State 4 |
Do not receive |
Do not receive |
Do not receive |
Do not receive |
Table 1
Similar, if the second regular expression (ab (c | d)) | (ab (e | f)) | (ab (g | h)) subexpression be arranged
Capture group, then after second regular expression being compiled into finite automata, each of finite automata it is final receive shape
State also can be corresponding at least one capture group, to identify the final state that receives is reached by the corresponding sublist of capture group
What formula obtained, it can determine that with the final result that state matches that receives be also with the corresponding son of corresponding capture group whereby
What expression formula obtained.
Finally, after the finite automata stated in use matches result, then institute can be matched according to the finite automata
Final when stating result receives the corresponding capture group of state, determines the first regular expression for matching the result.Also
It is to say, finally receives the corresponding subexpression of the corresponding capture group of state when matching the result, exactly match the knot
The first regular expression of fruit, subsequently then can be according to being in advance the processing strategy of first regular expression setting, to the text
And/or result is handled.
It is the matching process using regular expression that one or more embodiments of this specification provide above, based on same
The thinking of sample, this specification additionally provide the corresponding coalignment for utilizing regular expression, as shown in Figure 5.
Determining module 501 determines each first regular expression;
Each first regular expression is combined by composite module 502, obtains at least one second regular expression, institute
The quantity for stating the second regular expression is less than the quantity of the first regular expression;
Each second regular expression is compiled as finite automata by collector 503;
Matching module 504 is treated matched text using the finite automata and is matched, obtains result.
The composite module 502 is specifically used for, and each first regular expression is connected with random order, each two adjacent
It is connected by specified concatenation character between first regular expression, obtains the second regular expression.
Each first regular expression is the subexpression of second regular expression;
Described device further includes:
Setup module 505, for each second regular expression, for each subexpression in second regular expression
The one-to-one mark of setting.
The mark includes capture group;
Each of described finite automata it is final to receive state corresponding at least one capture group.
The matching module 504 is additionally operable to, and the final receiving when result is matched according to the finite automata
The corresponding capture group of state determines the first regular expression for matching the result.
The collector 503 is specifically used for, and is compiled as second regular expression using thompson construction algorithm
Non deterministic finite automaton NFA, and merge the node repeated in the NFA;Or, using Powerset construction
Second regular expression is compiled as deterministic finite automaton DFA by algorithm, and merges the node repeated in the DFA.
This specification also correspondence provides a kind of matching unit using regular expression, as shown in Figure 6.Pacify in the equipment
Equipped with application, which includes one or more memories and processor, and the memory stores program, and is configured to
Following steps are executed by one or more of processors:
Determine each first regular expression;
Each first regular expression is combined, at least one second regular expression, the second canonical table are obtained
It is less than the quantity of the first regular expression up to the quantity of formula;
Each second regular expression is compiled as finite automata;
Matched text is treated using the finite automata to be matched, and result is obtained.
In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example,
Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So
And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit.
Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause
This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device
(Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate
Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer
Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker
Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " patrols
Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development,
And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language
(Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL
(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description
Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL
(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby
Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present
Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer
This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages,
The hardware circuit for realizing the logical method flow can be readily available.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing
The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can
Read medium, logic gate, switch, application-specific integrated circuit (Application Specific Integrated Circuit,
ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller includes but not limited to following microcontroller
Device:ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited
Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that in addition to
Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic
Controller is obtained in the form of logic gate, switch, application-specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. to come in fact
Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it
The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions
For either the software module of implementation method can be the structure in hardware component again.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used
Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment
The combination of equipment.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this
The function of each unit is realized can in the same or multiple software and or hardware when specification.
It should be understood by those skilled in the art that, the embodiment of this specification can be provided as method, system or computer journey
Sequence product.Therefore, in terms of this specification can be used complete hardware embodiment, complete software embodiment or combine software and hardware
Embodiment form.Moreover, it wherein includes computer usable program code that this specification, which can be used in one or more,
The computer implemented in computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of program product.
This specification is with reference to the method, equipment (system) and computer according to this specification one or more embodiment
The flowchart and/or the block diagram of program product describes.It should be understood that flow chart and/or side can be realized by computer program instructions
The combination of the flow and/or box in each flow and/or block and flowchart and/or the block diagram in block diagram.It can provide
These computer program instructions are set to the processing of all-purpose computer, special purpose computer, Embedded Processor or other programmable datas
Standby processor is to generate a machine so that is executed by computer or the processor of other programmable data processing devices
Instruction generates specifies for realizing in one flow of flow chart or multiple flows and/or one box of block diagram or multiple boxes
Function device.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or
The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus
Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
Including so that process, method, commodity or equipment including a series of elements include not only those elements, but also wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that wanted including described
There is also other identical elements in the process of element, method, commodity or equipment.
This specification can describe in the general context of computer-executable instructions executed by a computer, such as journey
Sequence module.Usually, program module include routines performing specific tasks or implementing specific abstract data types, program, object,
Component, data structure etc..One or more embodiments that this specification can also be put into practice in a distributed computing environment, at this
In a little distributed computing environment, by executing task by the connected remote processing devices of communication network.It is counted in distribution
It calculates in environment, program module can be located in the local and remote computer storage media including storage device.
Each embodiment in this specification is described in a progressive manner, identical similar portion between each embodiment
Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so description is fairly simple, related place is referring to embodiment of the method
Part explanation.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the action recorded in detail in the claims or step can be come according to different from the sequence in embodiment
It executes and desired result still may be implemented.In addition, the process described in the accompanying drawings not necessarily require show it is specific suitable
Sequence or consecutive order could realize desired result.In some embodiments, multitasking and parallel processing be also can
With or it may be advantageous.
The foregoing is merely one or more embodiments of this specification, are not limited to this specification.For
For those skilled in the art, one or more embodiments of this specification can have various modifications and variations.It is all in this explanation
Any modification, equivalent replacement, improvement and so within the spirit and principle of one or more embodiments of book, should be included in
Within the right of this specification.