Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes.
The technical term of the application is described first as follows:
Domain name:(Domain Name), a certain calculates on the Internet that form of character separated by a string with " point "
Machine or the title for calculating unit, for identifying the electronic bearing of computer (sometimes referred to as geographical location, geography in data transmission
On domain name, refer to have administrative autonomy weigh a local area).RFC1034 and RFC1035 to domain name and domain name mapping into
Row definition.
URL(Uniform Resource Locator):Uniform resource locator from internet to can obtain
The position of resource and a kind of succinct expression of access method are the addresses of standard resource on internet.It is each on internet
For file all there are one unique URL, the information that it is included points out how the position of file and browser should handle it.
RFC1738 defines the addressing standard of URL.
AC algorithms (Aho-Corasick):AC algorithms are a kind of automatic state machine algorithms for multi-mode matching.
UTF-8(8-bit Unicode Transformation Format,RFC 3629):It is that one kind is directed to Unicode
Variable length character coding, also known as Unicode.1 to 6 byte code Unicode characters of UTF-8.With can be on webpage
The unified page shows Simplified Chinese traditional font and other Languages (such as English, Japanese, Korean).
GB18030:Standard GB/T 18030-2005《Information technology Chinese character code character set》It is China after GB2312-
Most important encoding of chinese characters standard after 1980 and GB13000.1-1993, it is basic to be that China's computer system must comply with
One of standard.There are two versions by GB18030:GB18030-2000 and GB18030-2005.GB18030 is the substitution version of GBK,
The backward compatible simplified Hanzi contained in GB2312, the Minorities In China character included in GBK, and newly extend CJK words
It accords with (China, Japan and Korea S. unify ideograph).
(American Standard Code for Information Interchange, U.S. information exchange ASCII
Standard code) it is a set of computer code's system based on the Latin alphabet, it is mainly used for showing current english and other Western-European languages.
DFA (Deterministic Finite Automaton, deterministic finite state machine), is a kind of receiving/refusal
The limited character string of symbol and unique finite state machine for calculating (or operation) that the automatic machine is exported for each input character string.
Centralization or distributed apparatus deployment scheme may be used in the embodiment of the present invention, wherein as shown in Figure 1, using dividing
During cloth deployment scheme, the application scenarios that the embodiment of the present invention provides include main control device 11 and forwarding unit 12, wherein main
The input parameter of control equipment 11 is the filtering rule of network administrator's configuration, such as at least one character that need to be filtered in the application
String and the AC algorithms of extension, output parameter are the DFA state machines of the AC algorithms compiling generation of extension;The input of forwarding unit 12
Parameter is DFA state machines and message to be matched, such as:Message to be matched can be HTTP message or from HTTP message solution
Host the and URL partial character strings delimited out are analysed, output is to treat the filter action of matching message, such as:Clearance, resistance are not matched
Disconnected, redirection or record log etc..Difference is deployed in using main control device during distributed arrangement 11 and forwarding unit 12 respectively
Board on, during using centralized deployment, main control device 11 and forwarding unit 12 can also be deployed in centralized device,
Function is deployed in different processes and realizes respectively.
Based on above-mentioned implement scene, embodiments herein provides a kind of data filtering method, with reference to shown in Fig. 2, packet
Include following steps:
101st, main control device obtains at least one character string that need to be filtered, the character string that need to be filtered include at least domain name or
URL keywords.
Wherein, the character string to avoid to filter can be by network administrator's typing, since the institute of network administrator is ripe
The language known is different, and it is generally also using different language, to adapt to multilingual record to need the domain name filtered or URL keywords
Enter, after the character string that need to be filtered in typing, the character that need to be filtered is converted to the character of ASCII rules by main control device, to lead
It controls equipment and step 102 is performed to the character of ASCII rules.Illustratively, main control device is first by network administrator's typing
The character string that need to be filtered is converted to the character of ASCII features.If the character string that need to be filtered is Indo-European rule, such as mistake
Filter sports, then the just only 1 keyword sports added into the corresponding AC trees of DFA state machines;If it need to filter
Character string is Chinese, such as filtering " patent ", then the character added into the corresponding AC trees of DFA state machines just needs to be converted into
Two kinds of character codes of UTF-8 and GB18030 possible coding (UTF-8 in URL:
%E4%B8%93%E5%88%A9, GB18030:%D7%A8%C0%FB).In this way, message to be matched
In the event of " patent " printed words in URL, success can also be matched without decoding.
102nd, main control device compiles generation DFA state machines, and will at least one character string that need to be filtered by AC algorithms
DFA state machines are sent to forwarding unit;Wherein, it is set in the corresponding AC trees of DFA state machines comprising asterisk wildcard, wherein asterisk wildcard
In the AC trees between any two adjacent states node, and asterisk wildcard matching any character.
AC algorithms are the AC algorithms of extension in step 102, and wherein AC algorithms take structure dendrogram similarly to the prior art
The process of (AC trees) only in this course, using asterisk wildcard " * " as a kind of special character input, is set to AC trees
Between middle any two adjacent states node, with reference to shown in Fig. 3, asterisk wildcard " * " is configured between state node 0 and 2.In addition,
Still principle is reused using maximum-prefix.
In addition, " the whole state of termination is configured by AC algorithms in AC trees according to the character that need to be filtered in main control device
Node ";The whole state node of termination, which is used to indicate forwarding unit stopping, being treated matching message and is filtered;Illustratively, if one
The filter action of a target string configuration is " blocking ", then if matching the target string in message to be matched,
Then the subsequent detection of message to be matched is also nonsensical, it should jump out detection as early as possible, this target string mark is into " termination is eventually
State ", as shown in figure 3, sport*, * sina*, * qq* is need " blocking ", then state node 16,17,10 is configured to termination eventually
State node.
In another embodiment, suction is configured by AC algorithms in AC trees according to the character that need to be filtered in main control device
Receive state node, wherein Link (s->T)=" * ", wherein, t is absorbing state node, and s is the parent status section of absorbing state node
Point, * are asterisk wildcard.When carrying out matching DFA state machines, absorbing state node can absorb any number of characters, such as in Fig. 3
State node 2.If it is whole state in itself to absorb node, this state is deleted from tree, his father's state node is allowed to become whole state.
103rd, forwarding unit receives message to be matched, and the AC trees according to DFA state machines treat matching message and are filtered.
Specifically, according to the DFA state machines that main control device in step 102 is configured, when matching " termination whole state node "
When, step 103 is specially that any character of the AC trees of foundation DFA state machines in message to be matched is matched is located at the whole shape of termination
State node is front and rear, and stopping is treated matching message and is filtered, and as shown in Figure 3, sport*, * sina*, * qq* is need " resistance
It is disconnected ", then when matching the character before state node 16,17,10, stop being filtered matching message.
According to the DFA state machines that main control device in step 102 is configured, when matching " absorbing state node ", forwarding is set
The standby AC trees according to DFA state machines match the word after absorbing state node after any character in matching message to be matched
Symbol.
Further, since introduce asterisk wildcard " * ", if therefore network administrator the semanteme of " comprising matching ", institute is configured
The character string of configuration will be front and rear automatically plus " * ", so as to after any character is matched, directly match absorbing state node
Character afterwards, without setting failure state mapping of the prior art.
In addition, when calculating DFA state machines, the output that character c is inputted under state node s is not unique, but one
Gather (possible outcome is empty set or occurs multiple).Using sport*, * sina*, * qq* be need " blocking " with *
Sohu*, * inna* need to compile out the AC trees of DFA state machines as shown in figure 3, carrying out treating matching message for record log
During matching, matching process is with reference to as shown in the table:
In addition, after forwarding unit receives message to be matched, matched content can be treated and delimited, be filtered, specifically
, forwarding unit obtains at least one to be matched according to the length for the character string that need to be filtered from each row in message to be matched
Then character string is filtered at least one character string to be matched according to the AC trees of DFA state machines.
It is illustrated below:Since the efficiency of AC algorithms (under DFA patterns) is unrelated with search key number, only with waiting to search
Target string (character string that need to be filtered) length of rope is related, therefore (can be the HTTP reports of input for message to be matched
Text), first according to n (line feed) complete demarcation line by line and parse, that is, obtain the smaller character string to be matched of range and be passed to AC calculations again
The DFA state machines of method are matched, and can greatly promote matching efficiency in this way.By taking following message to be matched as an example, by fixed
After boundary's parsing, character string (target text) to be matched to be searched is only
“eip.maipu.com”.This will height more than the efficiency of the entire message to be matched of search.
GET/HTTP/1.1
Accept:application/x-ms-application,image/jpeg,
application/xaml+xml,image/gif,image/pjpeg,application/x-ms-xbap,
*/*
Accept-Language:zh-cn
User-Agent:Mozilla/4.0(compatible;MSIE 7.0;Windows NT 6.1;
WOW64;Trident/4.0;SLCC2;.NET CLR 2.0.50727;.NET CLR
3.5.30729;.NET CLR 3.0.30729;Media Center PC 6.0;
InfoPath.3;.NET4.0C;.NET4.0E)
Accept-Encoding:gzip,deflate
Host:eip.maipu.com
Connection:Keep-Alive
In said program, main control device obtains at least one character string that need to be filtered, and the character string that need to be filtered is extremely
Include less following any:Domain name, URL keywords;At least one character string that need to be filtered by AC algorithms is compiled and is generated
DFA state machines, and the DFA state machines are sent to forwarding unit;Wherein, it is included in the corresponding AC trees of the DFA state machines
Asterisk wildcard, wherein the asterisk wildcard is set in the AC trees between any two adjacent states node, and the asterisk wildcard
Match any character;Forwarding unit receives message to be matched, and the AC trees according to the DFA state machines are to the message to be matched
It is filtered, in this way while data filtering performance is ensured, can directly be applicable in asterisk wildcard filtering.
With reference to shown in Fig. 4, a kind of main control device is provided, including:
Acquiring unit 41, for obtaining at least one character string that need to be filtered, the character string that need to be filtered includes at least
Domain name or URL keywords;
Processing unit 42, at least one character string that need to be filtered for being obtained to the acquiring unit 41 pass through AC algorithms
Compiling generation DFA state machines, wherein, comprising asterisk wildcard in the corresponding AC trees of the DFA state machines, wherein asterisk wildcard setting
In the AC trees between any two adjacent states node, and asterisk wildcard matching any character;
Transmitting element 43, for the DFA state machines to be sent to forwarding unit;So that the forwarding unit is according to described in
The AC trees of DFA state machines are filtered the message to be matched.
In a kind of illustrative scheme, processing unit 42 is additionally operable to described need to filter by what the acquiring unit 41 obtained
Character be converted to the characters of ASCII rules, to pass through AC algorithms compiling generation DFA shapes to the character of ASCII rules
State machine.
In a kind of illustrative scheme, processing unit 42 is specifically used for passing through AC algorithms according to the character that need to be filtered
Absorbing state node, wherein Link (s- are configured in AC trees>T)=" * ", wherein, t is the absorbing state node, and s is described
The parent status node of absorbing state node, * are the asterisk wildcard.
In a kind of illustrative scheme, the processing unit 42 is specifically used for passing through AC according to the character that need to be filtered
The whole state node of termination is configured in algorithm in AC trees;The whole state node of termination, which is used to indicate the forwarding unit, to be stopped to institute
Message to be matched is stated to be filtered.
With reference to shown in Fig. 5, a kind of forwarding unit, including:
Receiving unit 51, for receiving message and DFA state machine to be matched;Wherein, the corresponding AC trees of the DFA state machines
In comprising asterisk wildcard, wherein the asterisk wildcard is set in the AC trees between any two adjacent states node, and described
Asterisk wildcard matches any character;
Processing unit 52, for the AC trees of DFA state machines that are received according to the receiving unit to the message to be matched
It is filtered.
In a kind of illustrative scheme, processing unit 52 is additionally operable to according to the length of the character string that need to be filtered from institute
It states and at least one character string to be matched is obtained in each row in message to be matched, so as to the AC trees according to the DFA state machines
At least one character string to be matched is filtered.
In a kind of illustrative scheme, the processing unit 52 is specifically used for existing according to the AC trees of the DFA state machines
The character after the absorbing state node, wherein Link (s- are matched after matching any character in the message to be matched>t)
=" * ", t are the absorbing state node, and s is the parent status node of the absorbing state node, and * is the asterisk wildcard.
In a kind of illustrative scheme, the processing unit 52 is specifically used for existing according to the AC trees of the DFA state machines
Any character matched in message to be matched is front and rear positioned at the whole state node of termination, stops carrying out the message to be matched
Filtering, the whole state node of termination are used to indicate the forwarding unit stopping and the message to be matched are filtered.
Wherein above-mentioned main control device is used for the above-mentioned data filtering method of embodiment, therefore the skill of its generation with forwarding unit
Art effect is identical with above method embodiment, and which is not described herein again.
It should be noted that acquiring unit, processing unit can be the processor individually set up in main control device, it can also
It is integrated in some processor of controller and realizes, in addition it is also possible to be stored in depositing for controller in the form of program code
In reservoir, called by some processor of controller and perform the function of more than each unit.Processor described here can be with
It is a central processing unit
(Central Processing Unit, CPU) or specific integrated circuit (Application Specific
Integrated Circuit, ASIC) or be arranged to implement the embodiment of the present application one or more integrated circuits.
Transmitting element can be interface circuit or data sending device.Similar, processing unit can be individually to set in forwarding unit
Vertical processor, in addition it is also possible to be stored in the memory of controller in the form of program code, by some of controller
Processor calls and performs the function of more than each unit.Receiving unit can be interface circuit or data sink.
The embodiment of the present application also provides a kind of computer storage media for storing one or more programs, one or more journeys
Sequence includes instruction, which when executed by a computer, makes computer perform the correlation technique in Fig. 2.
In addition, a kind of computer program product is also provided, including above computer readable media (or medium).
It should be understood that in various embodiments of the present invention, the size of the serial number of above-mentioned each process is not meant to perform suitable
The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of the reply embodiment of the present invention
Process forms any restriction.
Those of ordinary skill in the art may realize that each exemplary lists described with reference to the embodiments described herein
Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
It is performed with hardware or software mode, specific application and design constraint depending on technical solution.Professional technician
Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed
The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit can refer to the corresponding process in preceding method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method, it can be with
It realizes by another way.For example, apparatus embodiments described above are only schematical, for example, the unit
It divides, only a kind of division of logic function can have other dividing mode, such as multiple units or component in actual implementation
It may be combined or can be integrated into another system or some features can be ignored or does not perform.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, equipment or unit
It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple
In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That each unit is individually physically present, can also two or more units integrate in a unit.
If the function is realized in the form of SFU software functional unit and is independent product sale or in use, can be with
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially in other words
The part contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, is used including some instructions so that a computer equipment (can be
People's computer, server or network equipment etc.) perform all or part of the steps of the method according to each embodiment of the present invention.
And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (English full name:Read-only memory, English letter
Claim:ROM), random access memory (English full name:Random access memory, English abbreviation:RAM), magnetic disc or light
The various media that can store program code such as disk.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in change or replacement, should all contain
Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.