CN108170812A - A kind of data filtering method and equipment - Google Patents

A kind of data filtering method and equipment Download PDF

Info

Publication number
CN108170812A
CN108170812A CN201711479497.XA CN201711479497A CN108170812A CN 108170812 A CN108170812 A CN 108170812A CN 201711479497 A CN201711479497 A CN 201711479497A CN 108170812 A CN108170812 A CN 108170812A
Authority
CN
China
Prior art keywords
matched
filtered
character
trees
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711479497.XA
Other languages
Chinese (zh)
Other versions
CN108170812B (en
Inventor
范浩
张勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Maipu Communication Technology Co Ltd
Original Assignee
Maipu Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Maipu Communication Technology Co Ltd filed Critical Maipu Communication Technology Co Ltd
Priority to CN201711479497.XA priority Critical patent/CN108170812B/en
Publication of CN108170812A publication Critical patent/CN108170812A/en
Application granted granted Critical
Publication of CN108170812B publication Critical patent/CN108170812B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the present invention provides a kind of data filtering method and equipment, network data processing field, can directly be applicable in asterisk wildcard filtering, ensure the performance to data filtering.This method, including:Main control device obtains at least one character string that need to be filtered, and the character string that need to be filtered includes at least following any:Domain name, URL keywords;The main control device compiles generation DFA state machines, and the DFA state machines are sent to forwarding unit at least one character string that need to be filtered by AC algorithms;Wherein, comprising asterisk wildcard in the corresponding AC trees of the DFA state machines, wherein the asterisk wildcard is set in the AC trees between any two adjacent states node, and asterisk wildcard matching any character;The forwarding unit receives message to be matched, and the AC trees according to the DFA state machines are filtered the message to be matched.

Description

A kind of data filtering method and equipment
Technical field
The embodiment of the present invention is related to network data processing field more particularly to a kind of data filtering method and equipment.
Background technology
Uniform resource locator (Uniform Resource Locator, URL) filtering is depth safety testing field One common requirement.Basic customer demand is:It identifies specific URL, specific action is then performed to message, such as:Block, Redirection or record log etc..It illustrates:If network administrator wishes that the Intranet user work hours do not allow to access Www.taobao.com, then domain filter will be used.If network administrator wishes that Intranet user is not permitted any time Perhaps the URL with " sports ", " sport " printed words is accessed, then url filtering will be used.The url filtering of network administrator is matched It is thousands of to put entry number demand.A basic algorithm for realizing the demand is exactly the AC calculations for multi-mode matching Method.
In the prior art, due to the diversity of the keyword of filtering, user usually can not the desired filtering of accurate expression oneself Domain name, therefore usually require using asterisk wildcard, and AC algorithms and asterisk wildcard " * " are incompatible, so in actual treatment or only Serial matching can be converted to or it is that asterisk wildcard " * " is converted to the semanteme of regular expression to perform regular expression matching, It needs to decode asterisk wildcard " * " by regular expression, is converted to the semanteme of regular expression to perform regular expression Match, since asterisk wildcard " * " is not directly adaptable to use AC algorithms, data strainability is had an impact.
In addition, the primary demand based on url filtering, the configuration method provided in the prior art is all based on greatly fixed beat Ascii character is printed, because the coding of domain name or URL also limit really must use visible ascii character, causes data mistake Filter does not adapt to multilingual demand.
Invention content
The embodiment of the present invention provides a kind of data filtering method and equipment, can directly be applicable in asterisk wildcard filtering, ensure To the performance of data filtering.
In a first aspect, a kind of data filtering method is provided, including:
Main control device obtains at least one character string that need to be filtered, the character string that need to be filtered include at least domain name or URL keywords;
The main control device compiles generation DFA state machines at least one character string that need to be filtered by AC algorithms, And the DFA state machines are sent to forwarding unit;Wherein, comprising asterisk wildcard in the corresponding AC trees of the DFA state machines, wherein The asterisk wildcard is set in the AC trees between any two adjacent states node, and the asterisk wildcard matches arbitrary word Symbol;
The forwarding unit receives message to be matched, according to the DFA state machines AC trees to the message to be matched into Row filtering.
Second aspect provides a kind of main control device, including:
Acquiring unit, for obtaining at least one character string that need to be filtered, the character string that need to be filtered includes at least domain Name or URL keywords;
Processing unit, for being compiled at least one character string that need to be filtered that the acquiring unit obtains by AC algorithms DFA state machines are generated, wherein, asterisk wildcard is included in the corresponding AC trees of the DFA state machines, wherein the asterisk wildcard is set to institute It states in AC trees between any two adjacent states node, and asterisk wildcard matching any character;
Transmitting element, for the DFA state machines to be sent to forwarding unit;So that the forwarding unit is according to described in The AC trees of DFA state machines are filtered the message to be matched.
The third aspect provides a kind of forwarding unit, including:
Receiving unit, for receiving message and DFA state machine to be matched;Wherein, in the corresponding AC trees of the DFA state machines Comprising asterisk wildcard, wherein the asterisk wildcard is set in the AC trees between any two adjacent states node, and described logical Any character is matched with symbol;Processing unit is treated for the AC trees of DFA state machines that are received according to the receiving unit to described It is filtered with message.
In said program, main control device obtains at least one character string that need to be filtered, and the character string that need to be filtered is extremely Include domain name or URL keywords less;Generation DFA states are compiled by AC algorithms at least one character string that need to be filtered Machine, and the DFA state machines are sent to forwarding unit;Wherein, asterisk wildcard is included in the corresponding AC trees of the DFA state machines, Wherein described asterisk wildcard is set in the AC trees between any two adjacent states node, and asterisk wildcard matching is arbitrary Character;Forwarding unit receives message to be matched, and the AC trees according to the DFA state machines carried out the message to be matched Filter in this way while data filtering performance is ensured, can directly be applicable in asterisk wildcard filtering.
Description of the drawings
It in order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be in embodiment or description of the prior art Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the present invention Example is applied, it for those of ordinary skill in the art, without creative efforts, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is the structure chart of a kind of application scenarios that the embodiment of the present invention provides;
Fig. 2 is the flow diagram of a kind of data filtering method that the embodiment of the present invention provides;
Fig. 3 is the structure chart of a kind of AC numbers that the embodiment of the present invention provides;
Fig. 4 is the structure chart of a kind of main control device that the embodiment of the present invention provides;
Fig. 5 is the structure chart of a kind of forwarding unit that the embodiment of the present invention provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes.
The technical term of the application is described first as follows:
Domain name:(Domain Name), a certain calculates on the Internet that form of character separated by a string with " point " Machine or the title for calculating unit, for identifying the electronic bearing of computer (sometimes referred to as geographical location, geography in data transmission On domain name, refer to have administrative autonomy weigh a local area).RFC1034 and RFC1035 to domain name and domain name mapping into Row definition.
URL(Uniform Resource Locator):Uniform resource locator from internet to can obtain The position of resource and a kind of succinct expression of access method are the addresses of standard resource on internet.It is each on internet For file all there are one unique URL, the information that it is included points out how the position of file and browser should handle it. RFC1738 defines the addressing standard of URL.
AC algorithms (Aho-Corasick):AC algorithms are a kind of automatic state machine algorithms for multi-mode matching.
UTF-8(8-bit Unicode Transformation Format,RFC 3629):It is that one kind is directed to Unicode Variable length character coding, also known as Unicode.1 to 6 byte code Unicode characters of UTF-8.With can be on webpage The unified page shows Simplified Chinese traditional font and other Languages (such as English, Japanese, Korean).
GB18030:Standard GB/T 18030-2005《Information technology Chinese character code character set》It is China after GB2312- Most important encoding of chinese characters standard after 1980 and GB13000.1-1993, it is basic to be that China's computer system must comply with One of standard.There are two versions by GB18030:GB18030-2000 and GB18030-2005.GB18030 is the substitution version of GBK, The backward compatible simplified Hanzi contained in GB2312, the Minorities In China character included in GBK, and newly extend CJK words It accords with (China, Japan and Korea S. unify ideograph).
(American Standard Code for Information Interchange, U.S. information exchange ASCII Standard code) it is a set of computer code's system based on the Latin alphabet, it is mainly used for showing current english and other Western-European languages.
DFA (Deterministic Finite Automaton, deterministic finite state machine), is a kind of receiving/refusal The limited character string of symbol and unique finite state machine for calculating (or operation) that the automatic machine is exported for each input character string.
Centralization or distributed apparatus deployment scheme may be used in the embodiment of the present invention, wherein as shown in Figure 1, using dividing During cloth deployment scheme, the application scenarios that the embodiment of the present invention provides include main control device 11 and forwarding unit 12, wherein main The input parameter of control equipment 11 is the filtering rule of network administrator's configuration, such as at least one character that need to be filtered in the application String and the AC algorithms of extension, output parameter are the DFA state machines of the AC algorithms compiling generation of extension;The input of forwarding unit 12 Parameter is DFA state machines and message to be matched, such as:Message to be matched can be HTTP message or from HTTP message solution Host the and URL partial character strings delimited out are analysed, output is to treat the filter action of matching message, such as:Clearance, resistance are not matched Disconnected, redirection or record log etc..Difference is deployed in using main control device during distributed arrangement 11 and forwarding unit 12 respectively Board on, during using centralized deployment, main control device 11 and forwarding unit 12 can also be deployed in centralized device, Function is deployed in different processes and realizes respectively.
Based on above-mentioned implement scene, embodiments herein provides a kind of data filtering method, with reference to shown in Fig. 2, packet Include following steps:
101st, main control device obtains at least one character string that need to be filtered, the character string that need to be filtered include at least domain name or URL keywords.
Wherein, the character string to avoid to filter can be by network administrator's typing, since the institute of network administrator is ripe The language known is different, and it is generally also using different language, to adapt to multilingual record to need the domain name filtered or URL keywords Enter, after the character string that need to be filtered in typing, the character that need to be filtered is converted to the character of ASCII rules by main control device, to lead It controls equipment and step 102 is performed to the character of ASCII rules.Illustratively, main control device is first by network administrator's typing The character string that need to be filtered is converted to the character of ASCII features.If the character string that need to be filtered is Indo-European rule, such as mistake Filter sports, then the just only 1 keyword sports added into the corresponding AC trees of DFA state machines;If it need to filter Character string is Chinese, such as filtering " patent ", then the character added into the corresponding AC trees of DFA state machines just needs to be converted into Two kinds of character codes of UTF-8 and GB18030 possible coding (UTF-8 in URL:
%E4%B8%93%E5%88%A9, GB18030:%D7%A8%C0%FB).In this way, message to be matched In the event of " patent " printed words in URL, success can also be matched without decoding.
102nd, main control device compiles generation DFA state machines, and will at least one character string that need to be filtered by AC algorithms DFA state machines are sent to forwarding unit;Wherein, it is set in the corresponding AC trees of DFA state machines comprising asterisk wildcard, wherein asterisk wildcard In the AC trees between any two adjacent states node, and asterisk wildcard matching any character.
AC algorithms are the AC algorithms of extension in step 102, and wherein AC algorithms take structure dendrogram similarly to the prior art The process of (AC trees) only in this course, using asterisk wildcard " * " as a kind of special character input, is set to AC trees Between middle any two adjacent states node, with reference to shown in Fig. 3, asterisk wildcard " * " is configured between state node 0 and 2.In addition, Still principle is reused using maximum-prefix.
In addition, " the whole state of termination is configured by AC algorithms in AC trees according to the character that need to be filtered in main control device Node ";The whole state node of termination, which is used to indicate forwarding unit stopping, being treated matching message and is filtered;Illustratively, if one The filter action of a target string configuration is " blocking ", then if matching the target string in message to be matched, Then the subsequent detection of message to be matched is also nonsensical, it should jump out detection as early as possible, this target string mark is into " termination is eventually State ", as shown in figure 3, sport*, * sina*, * qq* is need " blocking ", then state node 16,17,10 is configured to termination eventually State node.
In another embodiment, suction is configured by AC algorithms in AC trees according to the character that need to be filtered in main control device Receive state node, wherein Link (s->T)=" * ", wherein, t is absorbing state node, and s is the parent status section of absorbing state node Point, * are asterisk wildcard.When carrying out matching DFA state machines, absorbing state node can absorb any number of characters, such as in Fig. 3 State node 2.If it is whole state in itself to absorb node, this state is deleted from tree, his father's state node is allowed to become whole state.
103rd, forwarding unit receives message to be matched, and the AC trees according to DFA state machines treat matching message and are filtered.
Specifically, according to the DFA state machines that main control device in step 102 is configured, when matching " termination whole state node " When, step 103 is specially that any character of the AC trees of foundation DFA state machines in message to be matched is matched is located at the whole shape of termination State node is front and rear, and stopping is treated matching message and is filtered, and as shown in Figure 3, sport*, * sina*, * qq* is need " resistance It is disconnected ", then when matching the character before state node 16,17,10, stop being filtered matching message.
According to the DFA state machines that main control device in step 102 is configured, when matching " absorbing state node ", forwarding is set The standby AC trees according to DFA state machines match the word after absorbing state node after any character in matching message to be matched Symbol.
Further, since introduce asterisk wildcard " * ", if therefore network administrator the semanteme of " comprising matching ", institute is configured The character string of configuration will be front and rear automatically plus " * ", so as to after any character is matched, directly match absorbing state node Character afterwards, without setting failure state mapping of the prior art.
In addition, when calculating DFA state machines, the output that character c is inputted under state node s is not unique, but one Gather (possible outcome is empty set or occurs multiple).Using sport*, * sina*, * qq* be need " blocking " with * Sohu*, * inna* need to compile out the AC trees of DFA state machines as shown in figure 3, carrying out treating matching message for record log During matching, matching process is with reference to as shown in the table:
In addition, after forwarding unit receives message to be matched, matched content can be treated and delimited, be filtered, specifically , forwarding unit obtains at least one to be matched according to the length for the character string that need to be filtered from each row in message to be matched Then character string is filtered at least one character string to be matched according to the AC trees of DFA state machines.
It is illustrated below:Since the efficiency of AC algorithms (under DFA patterns) is unrelated with search key number, only with waiting to search Target string (character string that need to be filtered) length of rope is related, therefore (can be the HTTP reports of input for message to be matched Text), first according to n (line feed) complete demarcation line by line and parse, that is, obtain the smaller character string to be matched of range and be passed to AC calculations again The DFA state machines of method are matched, and can greatly promote matching efficiency in this way.By taking following message to be matched as an example, by fixed After boundary's parsing, character string (target text) to be matched to be searched is only
“eip.maipu.com”.This will height more than the efficiency of the entire message to be matched of search.
GET/HTTP/1.1
Accept:application/x-ms-application,image/jpeg,
application/xaml+xml,image/gif,image/pjpeg,application/x-ms-xbap,
*/*
Accept-Language:zh-cn
User-Agent:Mozilla/4.0(compatible;MSIE 7.0;Windows NT 6.1;
WOW64;Trident/4.0;SLCC2;.NET CLR 2.0.50727;.NET CLR
3.5.30729;.NET CLR 3.0.30729;Media Center PC 6.0;
InfoPath.3;.NET4.0C;.NET4.0E)
Accept-Encoding:gzip,deflate
Host:eip.maipu.com
Connection:Keep-Alive
In said program, main control device obtains at least one character string that need to be filtered, and the character string that need to be filtered is extremely Include less following any:Domain name, URL keywords;At least one character string that need to be filtered by AC algorithms is compiled and is generated DFA state machines, and the DFA state machines are sent to forwarding unit;Wherein, it is included in the corresponding AC trees of the DFA state machines Asterisk wildcard, wherein the asterisk wildcard is set in the AC trees between any two adjacent states node, and the asterisk wildcard Match any character;Forwarding unit receives message to be matched, and the AC trees according to the DFA state machines are to the message to be matched It is filtered, in this way while data filtering performance is ensured, can directly be applicable in asterisk wildcard filtering.
With reference to shown in Fig. 4, a kind of main control device is provided, including:
Acquiring unit 41, for obtaining at least one character string that need to be filtered, the character string that need to be filtered includes at least Domain name or URL keywords;
Processing unit 42, at least one character string that need to be filtered for being obtained to the acquiring unit 41 pass through AC algorithms Compiling generation DFA state machines, wherein, comprising asterisk wildcard in the corresponding AC trees of the DFA state machines, wherein asterisk wildcard setting In the AC trees between any two adjacent states node, and asterisk wildcard matching any character;
Transmitting element 43, for the DFA state machines to be sent to forwarding unit;So that the forwarding unit is according to described in The AC trees of DFA state machines are filtered the message to be matched.
In a kind of illustrative scheme, processing unit 42 is additionally operable to described need to filter by what the acquiring unit 41 obtained Character be converted to the characters of ASCII rules, to pass through AC algorithms compiling generation DFA shapes to the character of ASCII rules State machine.
In a kind of illustrative scheme, processing unit 42 is specifically used for passing through AC algorithms according to the character that need to be filtered Absorbing state node, wherein Link (s- are configured in AC trees>T)=" * ", wherein, t is the absorbing state node, and s is described The parent status node of absorbing state node, * are the asterisk wildcard.
In a kind of illustrative scheme, the processing unit 42 is specifically used for passing through AC according to the character that need to be filtered The whole state node of termination is configured in algorithm in AC trees;The whole state node of termination, which is used to indicate the forwarding unit, to be stopped to institute Message to be matched is stated to be filtered.
With reference to shown in Fig. 5, a kind of forwarding unit, including:
Receiving unit 51, for receiving message and DFA state machine to be matched;Wherein, the corresponding AC trees of the DFA state machines In comprising asterisk wildcard, wherein the asterisk wildcard is set in the AC trees between any two adjacent states node, and described Asterisk wildcard matches any character;
Processing unit 52, for the AC trees of DFA state machines that are received according to the receiving unit to the message to be matched It is filtered.
In a kind of illustrative scheme, processing unit 52 is additionally operable to according to the length of the character string that need to be filtered from institute It states and at least one character string to be matched is obtained in each row in message to be matched, so as to the AC trees according to the DFA state machines At least one character string to be matched is filtered.
In a kind of illustrative scheme, the processing unit 52 is specifically used for existing according to the AC trees of the DFA state machines The character after the absorbing state node, wherein Link (s- are matched after matching any character in the message to be matched>t) =" * ", t are the absorbing state node, and s is the parent status node of the absorbing state node, and * is the asterisk wildcard.
In a kind of illustrative scheme, the processing unit 52 is specifically used for existing according to the AC trees of the DFA state machines Any character matched in message to be matched is front and rear positioned at the whole state node of termination, stops carrying out the message to be matched Filtering, the whole state node of termination are used to indicate the forwarding unit stopping and the message to be matched are filtered.
Wherein above-mentioned main control device is used for the above-mentioned data filtering method of embodiment, therefore the skill of its generation with forwarding unit Art effect is identical with above method embodiment, and which is not described herein again.
It should be noted that acquiring unit, processing unit can be the processor individually set up in main control device, it can also It is integrated in some processor of controller and realizes, in addition it is also possible to be stored in depositing for controller in the form of program code In reservoir, called by some processor of controller and perform the function of more than each unit.Processor described here can be with It is a central processing unit
(Central Processing Unit, CPU) or specific integrated circuit (Application Specific Integrated Circuit, ASIC) or be arranged to implement the embodiment of the present application one or more integrated circuits. Transmitting element can be interface circuit or data sending device.Similar, processing unit can be individually to set in forwarding unit Vertical processor, in addition it is also possible to be stored in the memory of controller in the form of program code, by some of controller Processor calls and performs the function of more than each unit.Receiving unit can be interface circuit or data sink.
The embodiment of the present application also provides a kind of computer storage media for storing one or more programs, one or more journeys Sequence includes instruction, which when executed by a computer, makes computer perform the correlation technique in Fig. 2.
In addition, a kind of computer program product is also provided, including above computer readable media (or medium).
It should be understood that in various embodiments of the present invention, the size of the serial number of above-mentioned each process is not meant to perform suitable The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of the reply embodiment of the present invention Process forms any restriction.
Those of ordinary skill in the art may realize that each exemplary lists described with reference to the embodiments described herein Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is performed with hardware or software mode, specific application and design constraint depending on technical solution.Professional technician Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit can refer to the corresponding process in preceding method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method, it can be with It realizes by another way.For example, apparatus embodiments described above are only schematical, for example, the unit It divides, only a kind of division of logic function can have other dividing mode, such as multiple units or component in actual implementation It may be combined or can be integrated into another system or some features can be ignored or does not perform.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, equipment or unit It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.
If the function is realized in the form of SFU software functional unit and is independent product sale or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially in other words The part contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, is used including some instructions so that a computer equipment (can be People's computer, server or network equipment etc.) perform all or part of the steps of the method according to each embodiment of the present invention. And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (English full name:Read-only memory, English letter Claim:ROM), random access memory (English full name:Random access memory, English abbreviation:RAM), magnetic disc or light The various media that can store program code such as disk.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in change or replacement, should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (13)

1. a kind of data filtering method, which is characterized in that including:
Main control device obtains at least one character string that need to be filtered, and the character string that need to be filtered includes at least domain name or URL is closed Keyword;
The main control device compiles generation DFA state machines, and will at least one character string that need to be filtered by AC algorithms The DFA state machines are sent to forwarding unit;Wherein, comprising asterisk wildcard in the corresponding AC trees of the DFA state machines, wherein described Asterisk wildcard is set in the AC trees between any two adjacent states node, and asterisk wildcard matching any character;
The forwarding unit receives message to be matched, and the AC trees according to the DFA state machines carried out the message to be matched Filter.
2. filter method according to claim 1, which is characterized in that the main control device leads to the character that need to be filtered It crosses before AC algorithms compiling generation DFA state machines, further includes:
The character that need to be filtered is converted to the character of ASCII rules by the main control device, so that the main control device is to institute The character for stating ASCII rules compiles generation DFA state machines by AC algorithms.
3. filter method according to claim 1, which is characterized in that after the forwarding unit receives message to be matched, also Including:
The forwarding unit obtained from each row in the message to be matched according to the length of the character string that need to be filtered to A few character string to be matched, so as to according to the AC trees of the DFA state machines at least one character string to be matched It is filtered.
4. filter method according to claim 1, which is characterized in that the main control device leads to the character that need to be filtered AC algorithms compiling generation DFA state machines are crossed, including:
Absorbing state node is configured by AC algorithms according to the character that need to be filtered in the main control device in AC trees, wherein Link(s->T)=" * ", wherein, t is the absorbing state node, and s is the parent status node of the absorbing state node, and * is The asterisk wildcard;
The forwarding unit is filtered the message to be matched according to the AC trees of the DFA state machines, including:
The forwarding unit according to the DFA state machines AC trees after any character in matching the message to be matched, Match the character after the absorbing state node.
5. filter method according to claim 1, which is characterized in that the main control device leads to the character that need to be filtered AC algorithms compiling generation DFA state machines are crossed, including:
The whole state node of termination is configured by AC algorithms according to the character that need to be filtered in AC trees for the main control device;It is described The whole state node of termination is used to indicate the forwarding unit stopping and the message to be matched is filtered;
The forwarding unit is filtered the message to be matched according to the AC trees of the DFA state machines, including:
Any character of the AC trees in message to be matched is matched according to the DFA state machines be located at the whole state node of termination it It is front and rear, stop being filtered the message to be matched.
6. a kind of main control device, which is characterized in that including:
Acquiring unit, for obtaining at least one character string that need to be filtered, the character string that need to be filtered include at least domain name or URL keywords;
Processing unit generates for being compiled at least one character string that need to be filtered that the acquiring unit obtains by AC algorithms DFA state machines, wherein, comprising asterisk wildcard in the corresponding AC trees of the DFA state machines, wherein the asterisk wildcard is set to the AC In tree between any two adjacent states node, and asterisk wildcard matching any character;
Transmitting element, for the DFA state machines to be sent to forwarding unit;So that the forwarding unit is according to the DFA shapes The AC trees of state machine are filtered the message to be matched.
7. main control device according to claim 6, which is characterized in that the processing unit is additionally operable to the acquiring unit The character that need to be filtered obtained is converted to the character of ASCII rules, so that the character to ASCII rules is calculated by AC Method compiling generation DFA state machines.
8. main control device according to claim 6, which is characterized in that the processing unit is specifically used for being needed according to described Absorbing state node, wherein Link (s- is configured by AC algorithms in the character of filter in AC trees>T)=" * ", wherein, t is the suction State node is received, s is the parent status node of the absorbing state node, and * is the asterisk wildcard.
9. main control device according to claim 6, which is characterized in that the processing unit is specifically used for being needed according to described The whole state node of termination is configured by AC algorithms in AC trees for the character of filter;The whole state node of termination is used to indicate described turn Hair equipment stopping is filtered the message to be matched.
10. a kind of forwarding unit, which is characterized in that including:
Receiving unit, for receiving message and DFA state machine to be matched;Wherein, it is included in the corresponding AC trees of the DFA state machines Asterisk wildcard, wherein the asterisk wildcard is set in the AC trees between any two adjacent states node, and the asterisk wildcard Match any character;
Processing unit, the AC trees of DFA state machines for being received according to the receiving unit carried out the message to be matched Filter.
11. forwarding unit according to claim 10, which is characterized in that
The processing unit is additionally operable to the length according to the character string that need to be filtered from each row in the message to be matched Obtain at least one character string to be matched, so as to according to the AC trees of the DFA state machines to described at least one to be matched Character string is filtered.
12. forwarding unit according to claim 10, which is characterized in that
Any of the processing unit specifically for the AC trees according to the DFA state machines in the message to be matched is matched The character after the absorbing state node, wherein Link (s- are matched after character>T)=" * ", t be the absorbing state node, s For the parent status node of the absorbing state node, * is the asterisk wildcard.
13. forwarding unit according to claim 10, which is characterized in that
Any character of the processing unit specifically for the AC trees according to the DFA state machines in message to be matched is matched It is front and rear positioned at the whole state node of termination, stop being filtered the message to be matched, the whole state node of termination is used for Indicate that the forwarding unit stopping is filtered the message to be matched.
CN201711479497.XA 2017-12-29 2017-12-29 Data filtering method and equipment Active CN108170812B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711479497.XA CN108170812B (en) 2017-12-29 2017-12-29 Data filtering method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711479497.XA CN108170812B (en) 2017-12-29 2017-12-29 Data filtering method and equipment

Publications (2)

Publication Number Publication Date
CN108170812A true CN108170812A (en) 2018-06-15
CN108170812B CN108170812B (en) 2020-06-19

Family

ID=62516446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711479497.XA Active CN108170812B (en) 2017-12-29 2017-12-29 Data filtering method and equipment

Country Status (1)

Country Link
CN (1) CN108170812B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110896380A (en) * 2019-11-28 2020-03-20 迈普通信技术股份有限公司 Flow table screening method and device, electronic equipment and readable storage medium
CN113505585A (en) * 2021-07-15 2021-10-15 中南大学湘雅医院 High-speed character string feature matching method, device and equipment based on primitive state machine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
US20100266215A1 (en) * 2009-04-17 2010-10-21 Alcatel-Lucent Usa Inc. Variable-stride stream segmentation and multi-pattern matching
CN105045808A (en) * 2015-06-08 2015-11-11 北京天元特通科技有限公司 Composite rule set matching method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154228A (en) * 2006-09-27 2008-04-02 西门子公司 Partitioned pattern matching method and device thereof
US20100266215A1 (en) * 2009-04-17 2010-10-21 Alcatel-Lucent Usa Inc. Variable-stride stream segmentation and multi-pattern matching
CN105045808A (en) * 2015-06-08 2015-11-11 北京天元特通科技有限公司 Composite rule set matching method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PO-CHING LIN 等: "Using String Matching for Deep Packet Inspection", 《IEEE》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110896380A (en) * 2019-11-28 2020-03-20 迈普通信技术股份有限公司 Flow table screening method and device, electronic equipment and readable storage medium
CN113505585A (en) * 2021-07-15 2021-10-15 中南大学湘雅医院 High-speed character string feature matching method, device and equipment based on primitive state machine
CN113505585B (en) * 2021-07-15 2023-03-21 中南大学湘雅医院 High-speed character string feature matching method, device and equipment based on primitive state machine

Also Published As

Publication number Publication date
CN108170812B (en) 2020-06-19

Similar Documents

Publication Publication Date Title
Gagolewski stringi: Fast and portable character string processing in R
US9582666B1 (en) Computer system for improved security of server computers interacting with client computers
Levine Flex & Bison: Text Processing Tools
CN102597993B (en) Managing application state information by means of uniform resource identifier (URI)
CN106970820A (en) Code storage method and code storage
CN107992741A (en) A kind of model training method, the method and device for detecting URL
KR101874373B1 (en) A method and apparatus for detecting malicious scripts of obfuscated scripts
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
WO2011106800A1 (en) System, method, and computer program product for applying a regular expression to content based on required strings of the regular expression
CN110007906B (en) Script file processing method and device and server
CN107526742B (en) Method and apparatus for processing multilingual text
WO2002001312A2 (en) Method and system of intelligent information processing in a network
CN102647414A (en) Protocol analysis method, protocol analysis device and protocol analysis system
CN108170812A (en) A kind of data filtering method and equipment
Stubblebine Regular expression pocket reference
CN108563629A (en) A kind of daily record resolution rules automatic generation method and device
US9208134B2 (en) Methods and systems for tokenizing multilingual textual documents
CN113419721B (en) Web-based expression editing method, device, equipment and storage medium
CN105718463A (en) Keyword fuzzy matching method and device
CN110309364B (en) Information extraction method and device
CN112559112B (en) Interface node positioning method and device
US20160154785A1 (en) Optimizing generation of a regular expression
US20140309984A1 (en) Generating a regular expression for entity extraction
CN104778232B (en) Searching result optimizing method and device based on long query
CN111832070B (en) Data masking method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 610041 15-24 floor, 1 1 Tianfu street, Chengdu high tech Zone, Sichuan

Patentee after: MAIPU COMMUNICATION TECHNOLOGY Co.,Ltd.

Address before: 610041, 17 floor, maple building, 1 building, 288 Tianfu street, Chengdu, Sichuan.

Patentee before: MAIPU COMMUNICATION TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 610041 nine Xing Xing Road 16, hi tech Zone, Sichuan, Chengdu

Patentee after: MAIPU COMMUNICATION TECHNOLOGY Co.,Ltd.

Address before: 610041 15-24 floor, 1 1 Tianfu street, Chengdu high tech Zone, Sichuan

Patentee before: MAIPU COMMUNICATION TECHNOLOGY Co.,Ltd.