WO2014000485A1 - Content filtration method and device - Google Patents

Content filtration method and device Download PDF

Info

Publication number
WO2014000485A1
WO2014000485A1 PCT/CN2013/073462 CN2013073462W WO2014000485A1 WO 2014000485 A1 WO2014000485 A1 WO 2014000485A1 CN 2013073462 W CN2013073462 W CN 2013073462W WO 2014000485 A1 WO2014000485 A1 WO 2014000485A1
Authority
WO
WIPO (PCT)
Prior art keywords
rule
matching
content
filtering
condition
Prior art date
Application number
PCT/CN2013/073462
Other languages
French (fr)
Chinese (zh)
Inventor
尤里•哈桑
艾维•菲尔
莫默
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2014000485A1 publication Critical patent/WO2014000485A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • Embodiments of the present invention relate to data processing technologies, and in particular, to a content filtering method and apparatus. Background technique
  • a filtering policy can be configured to filter webpages of certain types of content, thereby restricting behaviors prohibited by internal users of the enterprise network, such as prohibiting access to bad websites or watching online movies.
  • the prior art typically classifies a target website by using a Uniform Universa Resource Locator (URL) address in a Hypertext Transfer Protocol (HTTP) request message. If the web page is found to be of a type that should be filtered, such as pornography, violence, etc., redirect the HTTP request to another prompt page, or disconnect the network connection directly.
  • URL Uniform Universa Resource Locator
  • HTTP Hypertext Transfer Protocol
  • the existing content filtering technology generally pre-sets the rule conditions and filtering conditions by the user, and uses the pre-compiled filter to match the URL address of the requested webpage with the rule condition, and the URL address that matches the rule condition is matched, and then filters. Conditions are blocked or released.
  • the rule condition may be, for example, a single string matching condition such as "if URL contains s ina” and "if URL equals www.abc.com”, and each rule condition may be based on determining a finite state automaton (De termini st ic Fini te- The Sta te Automata (DFA) algorithm forms a DFA map, and each web page address is accurately matched based on the DFA map to determine whether it is consistent with the rule conditions.
  • DFA Sta te Automata
  • the filter condition may be, for example, "the policy of releasing the webpage when the "if” URL contains s ina”, or "blocking or redirecting the webpage when the "if URL” is equal to www.abc.com”Strategy”. Therefore, it is necessary to further match the webpage address that matches the rule condition in the filter condition to determine which one to execute. Processing strategy.
  • the rule condition matching method for content filtering of URL addresses is performed by using DFA graphs.
  • DFA graphs When the number of rule conditions is too large or requires support for complex rule condition configuration, for example, a regular expression type including a wildcard, ". */ Abc. */news” , ". * ⁇ . www ⁇ . doma in. * ⁇ . com", etc., U'J will encounter the problem of ilj consuming a lot of memory. This is the main drawback of the DFA algorithm.
  • the prior art can use a compressed DFA, such as the D2FA (Delayed DFA) algorithm instead of the standard DFA for matching, but the matching performance is low because the time efficiency of the D2FA algorithm is several times lower than the standard DFA.
  • Embodiments of the present invention provide a content filtering method and apparatus to reduce memory usage of content filtering and obtain a good matching effect.
  • An embodiment of the present invention provides a content filtering method, including:
  • the rule condition is accurately matched to the content to be filtered
  • An embodiment of the present invention further provides a content filtering apparatus, including a content obtaining module, a content filtering module, and a policy implementation module, where
  • the content obtaining module is configured to acquire content to be filtered;
  • the content filtering module includes:
  • a keyword extracting unit configured to respectively extract keywords from one or more input rule conditions
  • a packet compiling unit configured to divide the one or more rule conditions into one or more packets according to the extracted keywords, so that rule conditions in the same group have the same keyword, and pre-select the extracted keywords Compiling a packet matching data set;
  • a rule condition compiling unit configured to precompile an exact matching data set for a rule condition of each keyword corresponding to each of the extracted keywords
  • a packet matching unit configured to perform keyword matching on the to-be-filtered content by using the packet matching data set, to obtain a matched keyword
  • a rule condition matching unit configured to perform an exact match condition of the rule to be filtered by using an exact matching data set of a rule condition of the matched keyword corresponding group
  • the policy implementation module is configured to perform a filtering policy corresponding to the matching result according to the matching result of the exact matching.
  • the content filtering method and apparatus provided by the embodiments of the present invention, because packet pre-filtering is performed on the rule condition based on the keyword, the number of rule conditions in each group is small, and the exact matching data set corresponding to each set of rule conditions is constructed.
  • the sum of memory usage takes up less memory than a data set formed by precompiling all rule conditions.
  • the technical solution of the embodiment of the present invention optimizes the matching performance on the basis of occupying less memory, and obtains a more accurate matching result.
  • FIG. 1 is a flowchart of a content filtering method according to Embodiment 1 of the present invention.
  • FIG. 2 is a flowchart of a content filtering method according to Embodiment 3 of the present invention.
  • FIG. 3 is a flowchart of a content filtering method according to Embodiment 4 of the present invention.
  • FIG. 5 is a flowchart of a applicable example of Embodiment 5 of the present invention.
  • FIG. 6 is a schematic structural diagram of a content filtering apparatus according to Embodiment 6 of the present invention
  • FIG. FIG. 7 is a schematic structural diagram of a content filtering apparatus according to Embodiment 7 of the present invention
  • FIG. 8 is a schematic structural diagram of a content filtering apparatus according to Embodiment 8 of the present invention
  • FIG. 9 is a schematic diagram of a network architecture applicable to Embodiment 9 of the present invention.
  • FIG. 10 is a schematic diagram of a process for extracting a keyword in a content filtering method according to Embodiment 9 of the present invention.
  • FIG. 1 is a schematic diagram of a filtering process performed in a content filtering method according to Embodiment 9 of the present invention.
  • FIG. 12 is a schematic diagram showing a correspondence between a packet and an algorithm in a content filtering method according to an embodiment of the present invention.
  • FIG. 13 is a schematic structural diagram of a computer system according to an embodiment of the present invention.
  • FIG. 14 is a schematic structural diagram of a computer system according to another embodiment of the present invention. detailed description
  • FIG. 1 is a flowchart of a content filtering method according to Embodiment 1 of the present invention.
  • the content filtering method in this embodiment may be applicable to various scenarios that need to filter text content, and may be implemented by software and/or hardware.
  • Web content filtering, typically performed based on a text application layer protocol, can be implemented by software integrated in the gateway.
  • the content filtering method mainly includes a pre-compilation process for the rule condition and a filtering process for the content to be filtered, and specifically includes the following steps:
  • Step 1 1 0 Extract keywords from one or more input rule conditions respectively;
  • Step 1 20 Divide the one or more rule conditions into one or more groups according to the extracted keywords, so that the same group is in the same group Rule conditions have the same keyword, and precompile the packet matching data set for the extracted keywords;
  • Step 1 30 A rule strip corresponding to each keyword in each of the extracted keywords Precompiled exact match data set;
  • steps 1 1 0-1 30 are pre-compilation processes, which are to compile and process the rule conditions input by the user, so as to quickly match the filtered content when the filtering process is executed.
  • Step 140 Obtain content to be filtered.
  • Step 1 50 Perform, by using the group matching data set, a keyword matching of the to-be-filtered content to obtain a matched keyword;
  • Step 160 Perform exact matching of the ruled content on the to-be-filtered content by using the exact matching data set of the rule condition of the matched keyword corresponding grouping;
  • Step 170 Perform a filtering policy corresponding to the matching result according to the matching result of the exact matching.
  • the above steps 140-17 0 are content filtering processes, which are operations for matching the filtered content based on the matching data set constructed by the pre-compilation process.
  • the matching data set in the content filtering technology applicable to the rule condition and the filtering rule may be referred to as a content filtering rule base, and the rule condition and the filtering rule are generally dynamically configured by a user such as an administrator, instead of being manually/remotely updated by the device provider periodically. of. Therefore, how to automatically construct an efficient content filtering rule base based on the rule conditions and filtering rules entered by the user is a key issue in implementing the content filtering method.
  • the rule condition is generally the content matched by a field in the text application protocol. If multiple fields need to be matched in the filtering process, for example, different fields may include a URL address, a content type (Con t en t-Type) header field, a user agent (User-Agen t) header field, etc., Fields, the precompilation process is performed separately for the rule conditions corresponding to each field.
  • the pre-compilation process executed by this embodiment is described by taking one field as an example. If the rule condition of multiple field contents is repeated, the technical solution of this embodiment may be repeatedly executed.
  • the extracted keywords are extracted from rule conditions based on a preset policy, and the keyword is a field that can represent the core content of the rule condition with a small number of characters as much as possible.
  • the preset policy for extracting keywords that meet this requirement can be implemented in various ways, which will be introduced through subsequent embodiments. Since the extracted keywords are used to reflect the core content of the rule condition, the rule conditions are grouped based on the keywords, that is, the rule conditions with similar contents are grouped into the same group by grouping the rule pieces having the same keyword into one group. In the middle, the same keyword is not strict.
  • the grid is limited to the same text, and the associated keywords can also be considered to have the same keyword based on the preset policy.
  • a group matching data set is pre-compiled for all keywords, and on the other hand, an exact matching data set is pre-compiled for each group of rule conditions.
  • the so-called data set pre-compiled data according to a content matching algorithm which can quickly complete string comparison when performing matching, such as pure string matching algorithm, non-deterministic finite state automaton ( Nondetermini stic Fini te-s ta te Automa Ta, abbreviated as NFA) I can be used as a matching data set, such as the algorithm, the DFA algorithm, and so on.
  • Both the packet matching data set and the exact matching data set preferably employ a matching algorithm capable of exactly matching the character string.
  • a matching algorithm capable of exactly matching the character string. For example, consider the balance of performance and memory footprint. According to memory specifications, the higher performance algorithm generally consumes more memory, and vice versa. Most of the network data needs to be processed by the packet matching algorithm, while a small amount of data is matched to the packet for further exact matching. Therefore, for the keyword matching algorithm of keywords, it can be tilted to improve performance, and ensure that the keywords are quickly matched. For the exact matching algorithm of the rule condition, it can be tilted in the direction of less memory occupation, so as to avoid the excessive increase of the rule condition and occupy too much memory.
  • the content to be filtered is first matched with the group matching data set to identify whether the keyword to be filtered contains keywords, and Which keyword is included.
  • the content to be filtered is accurately matched with the rule condition by using the accurate matching data set matched to the corresponding group of the keyword.
  • the matching result can or cannot be matched to the rule condition.
  • This matching result can be used as the basis for subsequent filtering rule identification or execution of the corresponding processing strategy.
  • the matching content to be filtered does not contain a keyword, it obviously does not match any rule condition, and the exact matching may not be performed.
  • the matching result may also be used as a basis for executing the subsequent filtering policy.
  • the technical solution of the embodiment since the group pre-filtering is performed on the rule condition based on the keyword, the number of rule conditions in each group is small, and the sum of the memory occupied by each of the constructed exact matching data sets is larger than the data set compiled by all the rule conditions. Take up less memory. After the packet is pre-filtered and then based on the exact matching of the rule conditions, the content to be filtered can be accurately compared with the rule conditions, and the matching accuracy is high. Therefore, the technical solution of the embodiment optimizes the matching performance on the basis of occupying less memory, and obtains a more accurate matching result.
  • step 11 0 extracts the operation of the keyword, and there is still The possibility of extracting a keyword according to a preset policy.
  • the rule condition for which the keyword cannot be extracted may be discarded, but it is preferable to perform the following operations:
  • the rule condition When it is recognized that the input rule condition cannot extract the keyword, the rule condition is put into the to-be-proposed group, and the exact matching data set is pre-compiled for the rule condition of the to-be-presented group, and the rule condition bad prompt is issued to the user.
  • the method further includes: when the content to be filtered does not match the keyword, using the to-be-prompted packet
  • the exact matching data set corresponding to the rule condition performs an exact matching of the rule conditions on the to-be-filtered content that does not match the keyword.
  • the keyword cannot be extracted. It indicates that the content to be filtered containing the conditions of such a rule cannot be grouped according to the keyword and then matched exactly, and only a complete exact match can be performed. Accurately matching all the content to be filtered without keywords can further ensure the accuracy of all filtering, but this will not be conducive to reducing memory. At the same time, the exact matching performance of such rule conditions is usually lower than the packet matching. It consumes a lot of time performance. Therefore, such a situation can send a bad condition to the user, indicating that such rule conditions will increase the burden of the system's time and space performance, and should avoid setting such rule conditions.
  • the content to be filtered may be a deep packet inspection (DPI) technology for protocol identification of the received data packet.
  • DPI deep packet inspection
  • the text type protocol type for content filtering includes HTTP.
  • a protocol type such as a Session Initiation Protocol (SIP) or a Real Time Streaming Protocol (RTSP); based on the identified protocol, performing field parsing on the data packet to obtain at least one preset field
  • SIP Session Initiation Protocol
  • RTSP Real Time Streaming Protocol
  • the filtering rule is a combination of one or more rule conditions, and the filtering rule is formed by combining one or more rule conditions corresponding to one or more preset fields.
  • the preset field may include a request method of an HTTP message in an HTTP protocol packet, a request URL, and a content type.
  • the content filtering method provided by the second embodiment of the present invention may further improve the pre-compilation and filtering process of the filtering rule based on the foregoing embodiment.
  • the filtering rule Pre-compilation and filtering can be performed based on various technologies. For example, after matching the rule conditions, the corresponding identifiers are recorded, and then the filtering rules are respectively matched in the respective filtering rules based on the identification, and then the corresponding filtering policies are executed. Or use a tree structure to construct various filtering rules, and match the matched rule conditions in the tree structure.
  • This embodiment provides another preferred filtering rule matching scheme. At any time of the pre-compilation process, the following steps are performed:
  • performing a filtering policy corresponding to the matching result according to the matching result of the exact matching includes:
  • condition identifier of the rule condition to which the content to be filtered is exactly matched is used as a character, and the filter rule is matched to the character, and the rule condition to which the content to be filtered is accurately matched is filtered by the rule.
  • the content is precisely matched to the rule conditions.
  • the filtering rule is usually composed of one or more rule conditions. When the conditions of the rule are satisfied by the content to be filtered, the filtering rule is successfully matched, and the corresponding filtering policy is executed correspondingly, for example, the webpage is redirected to a prompt page to inform the user. The request has been blocked; the web page is directly discarded and the client connection is reset; the filtering policy such as the web page is released.
  • condition identifier of the rule condition is used as a character
  • the form of the filter rule is a character string formed by the condition identifier, that is, the condition identifier of the condition rule is converted into a regular expression
  • multiple filter rules can be uniformly pre-compiled and realized. Multi-mode matching, and then one-time matching can be used to determine which filtering rule is to be filtered, and no need to query multiple times to optimize filtering performance.
  • filter matching it is executed in the order predefined by the filtering rules:
  • the first content to be filtered is a "Domain" field, which records the condition identifier of the rule condition to which the content to be filtered matches;
  • the second content to be filtered is a "User-Agent" field, which records the condition identifier of the rule condition to which the content to be filtered matches;
  • the third content to be filtered is the "Content-Type” field, which records the condition identifier of the rule condition to which the content to be filtered matches. Note that the last character of the regular expression is ".”, indicating any;
  • the filtering policy can be learned.
  • condition identifiers When the number of condition identifiers is greater than 255, that is, a single character cannot be used as a condition identifier, all rule conditions can be identified by a double-byte condition.
  • the third condition identifier below is 525, that is, when hexadecimal 0x020d.
  • FIG. 2 is a flowchart of a content filtering method according to Embodiment 3 of the present invention.
  • the pre-compilation processing of the rule condition and the filtering rule input by the user is introduced in the initial stage.
  • the user can add, delete, and change the rule condition and the filtering rule at any time, and the change operation is equivalent to deleting first. Additional actions.
  • the operation of the newly added rule component is optimized, and the content filtering method further performs the following operations:
  • Step 2 1 When the newly added rule condition is obtained, the keyword is extracted from the newly added rule condition;
  • Step 220 Search or create a corresponding group according to a keyword extracted from the newly added rule condition, and recompile the group matching data set.
  • the step may first search for an existing keyword in the existing group. If no corresponding keyword is found, a new group is created for the keyword, and the group matching data set is recompiled, and no corresponding correspondence is found. The keywords do not need to recompile the group matching data set.
  • Step 2 30 Precompile the accurate matching data set of the rule condition of the corresponding group according to the newly added rule condition
  • This step distinguishes between the existing grouping and the new grouping, and is recompiled. There may be unused compilation methods for data sets implemented by different algorithms. Therefore, if DFA is used to compile all intra-group rule conditions into a state machine, the entire DFA state machine must be recompiled. If the packet uses block-by-single-mode matching, then Just compile the new rule conditions and add them to the matching chain.
  • Step 240 Assign a condition identifier to the newly added rule condition, and recompile the filter matching data set.
  • the technical solution of this embodiment can enable the user to flexibly add new rule conditions.
  • the newly added rule condition only needs to update the group matching data set, the filtered matching data set, and a set of exact matching data sets. If the new rule condition does not generate a new one, For keywords, there is no need to update the group matching data set, and it is not necessary to adjust all the pre-compiled data sets relative to the prior art.
  • FIG. 3 is a flowchart of a content filtering method according to Embodiment 4 of the present invention. This embodiment further optimizes the operation process of deleting the rule condition based on the above embodiment.
  • the content filtering method further includes the following steps: Step 31: Delete the instruction according to the input rule condition, determine the rule condition to be deleted or the condition identifier corresponding to the rule condition to be deleted, and extract the keyword from the rule to be deleted;
  • Step 320 Update a group matching data set according to a keyword extracted from a rule to be deleted.
  • Step 3 If the rule to be deleted is to be deleted, re-compile the exact matching data set with the rule of the corresponding group of the keywords extracted from the rule to be deleted, to delete the rule to be deleted.
  • Step 340 if the content needs to be deleted If the condition identifier corresponding to the rule condition is to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the to-be-deleted rule condition.
  • this embodiment can flexibly delete the rule conditions without adjusting all the pre-compiled data sets.
  • Adding, deleting, and changing filtering rules are similar to the rule conditions. You can recompile the filtering matching data collection according to the newly added filtering rules or filtering rule deletion instructions to add or delete filtering rules.
  • FIG. 4 is a flowchart of a content filtering method according to Embodiment 5 of the present invention.
  • keyword extraction is performed, and the quality of keyword extraction is directly related to subsequent packet matching and accuracy.
  • the operations of extracting keywords from one or more of the input rule conditions may be implemented in various ways, for example, including the following steps:
  • Step 41 On the input rule condition, perform field division according to the preset division policy.
  • Step 42 Filter the divided field based on the preset screening policy to obtain the keyword of the rule condition.
  • the operation of selecting the divided field based on the preset selection policy, and obtaining the keyword of the rule condition preferably performs the following process:
  • the field that matches the field in the blacklist is deleted; according to the number of hits of the recorded field, the field with the number of hits higher than the hit threshold is deleted.
  • the field with the least number of rule conditions for selecting the keyword group among the keywords of the rule condition is selected as the keyword of the rule condition.
  • multiple screening policies can be set according to requirements, and the execution order is not limited.
  • the divided fields can be selected in multiple rounds to obtain the fields of the core content of the rules.
  • the screening strategy of keywords is not limited to the above items.
  • the basis for determining the preferred screening strategy is: The more the number of missed hits of the keyword or the higher the false hit rate, the lower the actual matching performance; the more the number of rule conditions in the packet, the more memory is occupied. Therefore, the strategy of extracting keywords should try to balance the matching performance and memory usage.
  • the blacklist, the whitelist, and the number of missed hits can be updated by dynamic statistics, for example: the content to be filtered is subjected to the exact matching data set of the rule condition of the group corresponding to the matched keyword. After the exact matching of the rule condition, the method further includes: when the content to be filtered that matches the keyword does not match the corresponding rule condition by using the exact matching data set, updating the number of missed hits of the keyword;
  • the accuracy of the blacklist, whitelist, and number of missed hits can be updated to optimize the accuracy of the keyword extraction strategy, thereby optimizing the matching performance of the content filtering.
  • the extraction key, the grouping, and the pre-compilation operation are re-executed in the existing rule condition according to the set period, the number of missed hits and the blacklist, etc., to optimize the pre-compiled data set, and obtain better. Matching performance.
  • FIG. 5 is a flowchart of an applicable example of Embodiment 5 of the present invention.
  • a keyword dynamic statistical table is maintained in the system, as shown in Table 1, wherein the number of missed hits can be refreshed in real time during the running of the content filtering method, for example, according to a set period, or according to a set trigger condition. Refresh in real time.
  • Keyword hits The number of rule conditions grouped by this keyword is blacklisted Huaw 1 2 No goog 5 1 No
  • Blacklists and whitelists can be statically configured. Or, add a keyword with a number of false hits above the set threshold to the blacklist, or add a keyword with a number of false hits below the set threshold to the whitelist. In practical applications, the number of missed hits can be considered as a factor, and the hit rate can be considered as a factor.
  • the keyword dynamic maintenance table needs real-time updates, and is updated in real time as new keywords are extracted or deleted, and content filtering is performed.
  • Step 501 Obtain a rule condition that the device administrator enters the string form as a user online; for example, input the following rule conditions, where the rule condition may include a wildcard *, a range of character values [x-y], and the like:
  • Step 502 Perform field division on the input rule condition according to a preset division policy, and the purpose is to group the rule according to the keyword;
  • the fields are divided according to the preset separators ".”, "[", "]” or spaces, etc., and the number of characters of the field can be set, for example, only the number of strings below the set threshold is intercepted, such as only Extracting 4 characters and below, the above rule conditions divide the fields into ⁇ , huaw, com, goog, s ina, yaho, mi cr, msdn, and news.
  • Step 503 Delete the field in the blacklist based on the keyword dynamic maintenance table shown in Table 1;
  • the fields in the blacklist are usually too common fields and cannot be filtered.
  • Step 50 In the remaining fields after deleting the blacklist field, delete the field whose hit count is higher than the hit threshold according to the number of hit errors of the recorded field;
  • hit threshold is set to 4, then huaw, s ina, yaho, ms dn, and news are the filtered fields;
  • Step 505 Identify, from the filtered field, the number of rule conditions corresponding to each field, and select, for each rule condition, the field with the least number of rule conditions of the keyword group in each keyword of the rule condition. a keyword that is a condition of the rule;
  • the keywords corresponding to each rule condition after being filtered by step 505 are:
  • rule condition 5 since the number of rule conditions of yaho is 1 in the keyword group of yaho and news, less than the number of rule conditions in the news group, rule condition 5 selects yaho as a key. Similarly, rule condition 7 selects ms dn as the key.
  • the number of rule conditions for keyword grouping in Table 1 is the keyword of each rule Determine which is updated in real time.
  • conditional rules that do not extract keywords are bad conditional rules and need to be prompted to the user.
  • the rule conditions are grouped according to the keywords, and the accurate matching data set pre-compiled after the grouping can use different compiling algorithms.
  • the pre-compiled exact matching data set of the rule condition corresponding to each keyword in the extracted keywords may specifically include:
  • the NFA, DFA, or compressed DFA regular expression matching algorithm is used to pre-compile the exact matching data set for the set of rule conditions, and the NFA regular expression algorithm is implemented, ij port PCRE (Per l Compa tible Regu lar Exp es si on ), or pre-compile an exact matching data set using a single-mode string matching algorithm, such as the BM ( Boyer Moor e ) matching algorithm.
  • the pre-configured threshold after identifying that the number of rule conditions is less than the pre-configured threshold, it may further determine that any regular expression related elements, such as wildcards, character ranges, etc., occur in the middle of the rule condition, and if so, NFA, DFA or compressed DFA, otherwise BM matching algorithm is used;
  • the DFA or compressed DFA regular expression matching algorithm is used to precompile all rule conditions into an exact matching data set for the set of rule conditions, such as DFA, D2FA state machine.
  • the pre-configured threshold can be set to 8, in order to take advantage of the performance of the D2FA multi-mode matching one-by-one matching with the single-mode matching algorithm. Or prefer spatial performance without considering the number of rules, and always use the NFA regular expression matching algorithm to pre-compile the rule conditions to the exact matching structure one by one;
  • the NFA or compressed DFA regular expression matching algorithm is used to precompile the exact matching data set for the set of rule conditions.
  • the so-called rule with set complex definition parameters may be a rule condition that is defined by experience to satisfy a certain degree of complexity to define a parameter. If such a rule condition is compiled into a DFA state opportunity, the number of states is sharply increased to occupy a large amount of memory, for example Floating, with "*,,,"? " ,
  • the rule conditions are grouped according to the selected keywords,
  • the group's pre-configured threshold is set to 2
  • the grouping situation and the exact matching data set used by each group can be as shown in Table 2 below:
  • FIG. 6 is a schematic structural diagram of a content filtering apparatus according to Embodiment 6 of the present invention.
  • the content filtering apparatus may be integrated into an apparatus for performing content filtering, such as an enterprise gateway, for performing the content filtering method provided by the present invention.
  • the content filtering device specifically includes a content obtaining module 61 0, a content filtering module 620, and a policy implementation module 630.
  • the content obtaining module 610 is configured to obtain the content to be filtered.
  • the content filtering module 620 specifically includes: a keyword extracting unit 621, a packet compiling unit 622, a rule condition compiling unit 623, a packet matching unit 624, and a rule condition matching unit 625.
  • the keyword extracting unit 621 is configured to respectively extract keywords from the input one or more rule conditions; the grouping and compiling unit 622 is configured to divide the one or more rule conditions into one or more groups according to the extracted keywords, Making the rule conditions in the same group have the same keyword, and pre-compiling the group matching data set for the extracted keyword; the rule condition compiling unit 62 3 is configured to respectively correspond to each keyword in the extracted keyword The grouping rule condition pre-compiling the exact matching data set; the group matching unit is configured to perform keyword matching on the to-be-filtered content by using the packet matching data set to obtain a matched keyword; the rule condition matching unit 625 is configured to: The exact matching data set of the rule condition of the matched keyword corresponding to the matched keyword is used to perform exact matching of the ruled content.
  • the policy implementation module 6 30 And a method for performing a filtering policy corresponding to the matching result according to the matching result of the exact matching.
  • the above technical solution provides pre-filtering of the filtered content by keyword grouping, and then performs exact matching, which can effectively balance the memory occupancy and matching performance precision, and provides an optimized content filtering scheme.
  • the content filtering module 62 may further include a filtering rule compiling unit 626.
  • the policy enforcement module 6 30 includes a filter rule matching unit 631 and a policy enforcement unit 632.
  • the filtering rule compiling unit 626 is configured to separately allocate a unique condition identifier for the one or more rule conditions, and pre-compile the filter matching data set for the filtering rule, where the filtering rule is combined by one or more rule conditions.
  • conditional identifier of the one or more rule conditions is used as a character to express the filtering rule;
  • the filtering rule matching unit 6 31 is configured to use the filtering matching data set to accurately match the to-filtered content to the rule condition
  • the condition identifier is used as a character to perform matching of the filter rule on the character, and the rule condition to which the content to be filtered is accurately matched is obtained by performing exact matching of the rule condition on the content to be filtered;
  • the policy implementation unit 632 is configured to The matching result of the filtering rule performs a filtering policy corresponding to the matching result.
  • conditional identifier By using the conditional identifier to represent the rule condition and further compiling the filter rule in the form of a regular expression, a filter match can be achieved to obtain a match result.
  • the rule condition compiling unit 62 3 is further configured to: when it is recognized that the input rule condition cannot extract the keyword, put the rule condition into the to-be-presented group, and pre-compile the rule condition of the group to be prompted. Matches the data collection and issues a bad rule condition to the user.
  • the rule condition matching unit is further configured to: when the content to be filtered does not match the keyword, use the exact matching data set corresponding to the rule condition of the to-be-presented group to filter the unmatched keyword The content performs an exact match of the rule conditions.
  • the above technical solution can ensure an exact match for all the content to be filtered, and can prompt the user to optimize the rule conditions to meet the pre-filtered grouping requirements.
  • FIG. 7 is a schematic structural diagram of a content filtering apparatus according to Embodiment 7 of the present invention.
  • the keyword extracting unit 621 preferably includes: a field dividing subunit 621a and a field filtering subunit 621b.
  • the field dividing sub-unit 62 1 a is configured to perform field division according to a preset dividing policy for the input rule condition; the field selecting sub-unit 62 1 b , A keyword used to filter the divided fields based on a preset screening policy to obtain the rule conditions.
  • the field filtering sub-unit is specifically configured to: delete, from the divided field, a field that is consistent with a field in the blacklist; according to the number of hits of the recorded field, the number of hits is higher than the hit threshold Delete; for each rule condition, select the field with the least number of rule conditions for the keyword grouping among the keywords of the rule condition as the keyword of the rule condition.
  • Other screening policies can be added, such as filtering fields that match the fields in the whitelist as keywords.
  • the content filtering module may further include a statistical update unit, and the statistical update unit specifically includes: a hit count counter unit and a black list update sub unit.
  • the number of hits is used to update the number of hits of the keyword when the content to be filtered that matches the keyword is not matched to the corresponding rule condition; the blacklist update subunit Used to blacklist keywords with a number of false hits above the set threshold.
  • the keyword extraction policy determines the quality of the keyword extraction, which is directly related to the pre-filtering efficiency.
  • the technical solution in this embodiment can dynamically update the data used by the keyword screening policy according to the actual content filtering situation, so that the extracted keywords are more Can reflect the needs of content filtering.
  • the rule condition compiling unit specifically includes:
  • a first compiling subunit configured to pre-compile an exact matching data set for the set of rule conditions using a NFA, DFA, or compressed DFA regular expression matching algorithm for a group whose rule condition is less than a pre-configured threshold value, or adopt a single
  • the modulo string matching algorithm precompiles the exact matching data set
  • a second compiling sub-unit configured to pre-compile an exact matching data set for the set of rule conditions using a DFA or a compressed DFA regular expression matching algorithm for a group of rule conditions having a number equal to or greater than a pre-configured threshold;
  • a third compiling sub-unit is configured to pre-compile the exact matching data set for the set of rule conditions using a NFA or compressed DFA regular expression matching algorithm for the grouping comprising rule conditions having a set complex definition parameter.
  • FIG. 8 is a schematic structural diagram of a content filtering apparatus according to Embodiment 8 of the present invention.
  • the present embodiment is based on the foregoing embodiment.
  • the improvement is that the content obtaining module 610 may specifically include a protocol identifying unit 611 and a protocol parsing unit 612.
  • the protocol identification unit 611 is configured to perform protocol identification on the received data packet by using a deep packet identification technology.
  • the protocol parsing unit 612 is configured to perform field parsing on the data packet to obtain at least one pre- A field is set, and each preset field is separately used as a content to be filtered, so as to perform subsequent group matching, exact matching, and filtering matching operations respectively, wherein the filtering rule is composed of one or more rule conditions, and the filtering is performed.
  • a rule is a combination of one or more rule conditions corresponding to one or more preset fields.
  • the content filtering apparatus provided by the embodiment of the present invention may perform the content filtering method provided by any embodiment of the present invention, and has a corresponding functional module structure.
  • the ninth embodiment of the present invention will describe the details of the content filtering method in detail by way of a preferred example.
  • the content filtering method provided by the embodiment of the present invention is performed based on a text application layer protocol, and the rule condition may be any field in the protocol, such as: a URL address, a request method, a certain header field, and the like.
  • This embodiment uses the URL address field as an example for description.
  • those skilled in the art can understand that the pre-compiled data set and the matching filtering method of other fields can be completed by the same scheme.
  • FIG. 9 is a schematic diagram of a network architecture applicable to a ninth embodiment of the present invention, where the network includes a local area network (LAN) network element, a wide area network (WAN) network element, and a router (Router). And switches (Swi t ch ) and so on.
  • the user terminal is connected to the WAN via a LAN via a switch and a router.
  • An application control node is deployed between the LAN and the WAN to implement content filtering. It should be understood that the application control node has the function of the content filtering device in the embodiment of the present invention.
  • the application control node herein may be an enterprise router, or a gateway GPRS support node (Gatex GPRS Supper t Node, GGSN for short) network element device, an Internet gateway device, and a wireless controller device, etc. Network element.
  • GGSN Gateway GPRS support node
  • the content filtering device is configured to participate in the embodiment 7 or 8 to specifically perform the content filtering method provided by the embodiment of the present invention.
  • the method mainly includes a pre-compilation process and a filtering process.
  • FIG. 10 is a schematic diagram of a process for extracting keywords in a content filtering method according to Embodiment 9 of the present invention.
  • the first step is to divide the (Parse) field
  • the second step is to divide the
  • the keyword is filtered by the blacklist in the field
  • the third step filters the keyword according to the number of missed hits
  • the fourth step selects the keyword according to the selection strategy with the least number of rule conditions.
  • msdn is selected as a keyword from the rule conditions.
  • FIG. 11 is a schematic diagram of a filtering process performed in a content filtering method according to Embodiment 9 of the present invention, and FIG. 11 illustrates a rule condition pre-compilation phase and a rule condition matching filtering phase.
  • rule conditions entered are as follows:
  • keywords are filtered for each rule condition, as shown in Fig. 11, the group matching data set is compiled by the AC state machine.
  • the keyword grouping as shown in Fig. 11, the first and second rule conditions are grouped into one group, the others are grouped by keyword, and the 6th and 8th uncharacterized rule conditions are classified into the bad rule condition group.
  • Each uses an algorithm to precompile the exact matching data sets for each group.
  • the content to be filtered is obtained and sent to the content filtering module, and the configured matching data set is pre-configured, and is also retained in the memory by the compiling process.
  • the content to be filtered is the website address www.huawei.com/news
  • the content filtering module first uses the group matching data set to perform keyword matching, for example, the content to be filtered is in the AC state machine. Multi-mode matching is performed, and the packet matching data set is used for pre-filtering, and the matched keyword is huaw.
  • the exact matching data set of the packet corresponding to the keyword is further used to see if the rule condition can be matched, and the matching result is that the matching is successful.
  • conditional identifier of the matched rule condition can be used as a character, and the matching data set is matched by filtering.
  • the matching results include matching success and failure, and the packet is processed according to the default release policy of the entire device configuration. For example, it can include a white list (matching successful release), There are two types of blacklists (matching successful filtering), and whether to send to the policy implementation module for further processing.
  • the content filtering solution provided by the embodiments of the present invention has many advantages, and can balance the problems of memory usage and matching performance.
  • the solution supports complex rule conditions, such as regular expressions, and supports multi-dimensional content filtering matching, not just URL addresses, but also any configurable header field content filtering.
  • Matching performance is improved by pre-filtering and dynamically collecting missed keywords. Dynamically collect keywords that affect performance, add blacklists, and periodically adjust the content filtering rule base, that is, periodically repeat the keyword-packet-precompilation process to achieve the optimal performance balance of the adaptive target operating environment. .
  • the embodiment of the present invention further provides a computer system, as shown in FIG. 13, the computer system includes at least one processor 1 31 and a memory 1 32; the memory 1 32 is used to store instructions; the processor 1 31, The memory 1 32 is coupled, and the processor 1 31 is configured to execute instructions stored in the memory 1 32 to perform the content filtering method provided by any of the embodiments of the present invention.
  • the processor 1 31 can be configured to execute instructions stored in the memory 1 32 to perform the following process:
  • the rule condition is accurately matched to the content to be filtered
  • the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:
  • Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching includes:
  • the condition identifier of the rule condition to which the content to be filtered is exactly matched is used as a character to match the filtering rule of the character, and the rule condition to be matched by the content to be filtered is filtered by the content to be filtered.
  • An exact matching of the rule conditions is performed; and a filtering policy corresponding to the matching result is performed according to the matching result of the filtering rule.
  • the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:
  • the keyword is extracted from the newly added rule condition; the corresponding rule is searched or created according to the keyword extracted from the newly added rule condition, and the group is recompiled. Matching data sets;
  • the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:
  • Deleting an instruction according to the input rule condition determining a rule condition to be deleted or a condition identifier corresponding to the condition to be deleted, and extracting a keyword from the rule to be deleted;
  • condition identifier corresponding to the rule to be deleted is to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the rule to be deleted.
  • the processor 1 31 can be configured to execute an instruction stored in the memory 1 32, and the one or more rule conditions are input from the input.
  • the keywords include the following processes:
  • the fields are divided according to the preset division strategy
  • the divided fields are filtered based on a preset screening policy to obtain keywords of the rule conditions.
  • the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then filter the divided fields based on a preset screening policy to obtain the key of the rule condition.
  • the word specifically includes the following process:
  • the field with the least number of rule conditions for selecting the keyword group among the keywords of the rule condition is selected as the keyword of the rule condition.
  • the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then match the exact matching data set of the rule condition of the corresponding group using the matched keyword. After the exact matching of the rule conditions to the content to be filtered of the keyword, the following process is further performed:
  • the processor 1 31 may be configured to execute an instruction stored in the memory 1 32, and then the rule conditions of each keyword corresponding to the extracted keywords are respectively Precompiling the exact match data set specifically includes the following process:
  • a set of rule conditions is used to determine a finite state automaton or a compressed finite state automaton regular expression matching algorithm to precompile the exact match data set;
  • the finite state automaton regular expression matching algorithm is used to precompile the exact matching data set using a non-deterministic finite state automaton or compression.
  • the processor 1 31 is configured to execute the instructions stored in the memory 1 32, and the obtaining the content to be filtered specifically includes the following process: using the deep report on the received data packet Text recognition technology for protocol identification;
  • the filtering rule is a combination of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.
  • the processor 1 31 is configurable to execute the instructions stored in the memory 1 32, and further performs the following process:
  • the rule condition When it is recognized that the input rule condition cannot extract the keyword, the rule condition is put into the to-be-proposed group, and the exact matching data set is pre-compiled for the rule condition of the to-be-presented group, and the rule condition bad prompt is issued to the user.
  • the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then use the group matching data set to perform keyword matching on the to-be-filtered content. After that, the following process is also performed:
  • the exact matching data set corresponding to the rule condition of the prompting group is used, and the ruled condition of the unfiltered content that does not match the keyword is accurately matched.
  • the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then extracting the keywords from the input one or more rule conditions specifically includes the following processes:
  • the keywords are extracted from one or more rule conditions that have been entered according to the set period.
  • the embodiment of the present invention further provides a computer system.
  • the computer system includes: a processor 141, a memory 142, and a matching filter 143.
  • the memory 142 is used to store instructions; the matching filter 143 is configured to configure each data set, such as a packet matching data set, an exact matching data set, and a filtered matching data set, etc.; the processor 141 is coupled to the memory 142 and the matching filter 14 3
  • the processor 141 is configured to execute the storage in the memory 142 An instruction to perform a pre-compilation process in the content filtering method provided by the embodiment of the present invention, and the processor 141 is further configured to invoke the matching filter 143 to perform content filtering in the content filtering method provided by the embodiment of the present invention. Process.
  • the matching filter can be implemented by hardware, or a combination of hardware and software.
  • it can be a Field Programmable Gate Array (FPGA).
  • the memory of the FPGA chip or the external memory stores various data sets, such as a packet matching data set, an exact matching data set of each group, a filtered matching data set, and the like, and then the matching logic of each matching unit is also implemented by the FPGA chip.
  • the various data sets perform content matching on the application protocol data, output the result of the keyword matching to the exact matching data set, or output an exact matching result to the corresponding filtering policy.
  • the protocol identification and field parsing operations before content filtering can be implemented by the FPGA.
  • the computer system provided by the foregoing embodiment of the present invention can be configured as various network elements for applying content filtering technologies, such as an enterprise router, a gateway GPRS Supper t Node (GGSN) network element device, an Internet gateway device, and a wireless device. Controller device.
  • GGSN gateway GPRS Supper t Node
  • the processor can be configured to execute the instructions in the memory to:
  • processor may be further configured to invoke the matching filter to: perform the following operations: acquiring the content to be filtered;
  • the rule condition is accurately matched to the content to be filtered
  • the processor is further configured to execute instructions in the memory to implement the following Operation:
  • the processor may be further configured to invoke the matching filter to: perform a filtering policy corresponding to the matching result according to the matching result of the exact matching, including: using the filtering matching data set, the content to be filtered
  • the condition identifier of the rule condition that is precisely matched is used as a character to perform matching of the filter rule on the character, and the rule condition to which the content to be filtered is accurately matched is obtained by performing exact matching of the rule condition on the content to be filtered;
  • the processor can be further configured to execute instructions in the memory and also perform the following operations:
  • the keyword is extracted from the newly added rule condition; the corresponding rule is searched or created according to the keyword extracted from the newly added rule condition, and the group is recompiled. Matching data sets;
  • the processor can be further configured to execute instructions in the memory and also perform the following operations:
  • Deleting an instruction according to the input rule condition determining a rule condition to be deleted or a condition identifier corresponding to the condition to be deleted, and extracting a keyword from the rule to be deleted;
  • condition identifier corresponding to the condition of the rule to be deleted is to be deleted, recompile the Filtering the data set to delete the condition identifier corresponding to the rule to be deleted.
  • the processor can be further configured to execute instructions in the memory and also perform the following operations:
  • the filter matching data set is recompiled according to the newly added filtering rule or filtering rule deletion instruction to add or delete a filtering rule.
  • the processor is configurable to execute instructions in the memory to implement the following operations, respectively: extracting keywords from the input one or more rule conditions includes:
  • the fields are divided according to the preset division strategy
  • the divided fields are filtered based on a preset screening policy to obtain keywords of the rule conditions.
  • the divided fields are selected based on a preset selection policy, and the keywords for obtaining the rule conditions include:
  • the field with the least number of rule conditions for selecting the keyword group among the keywords of the rule condition is selected as the keyword of the rule condition.
  • the processor is configured to execute the instructions in the memory to: perform an exact matching of the rule conditions on the to-be-filtered content by using an exact matching data set of the rule condition of the matched keyword corresponding to the matching keyword After that, it also includes:
  • the processor is configured to execute the instructions in the memory to: perform pre-compilation of the rule conditions for each keyword corresponding to each of the extracted keywords, respectively, and the exact matching data set includes:
  • the rule group for the group Precompiling an exact matching data set using DFA or a compressed DFA regular expression matching algorithm
  • the exact matching data set is precompiled for the set of rule conditions using the NFA or compressed DFA regular expression matching algorithm.
  • the processor can be further configured to execute an instruction in the memory or to call a matching filter to:
  • the obtaining the content to be filtered includes:
  • the filtering rule is a combination of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.
  • the processor is further configurable to execute instructions in the memory to:
  • the rule condition When it is recognized that the input rule condition cannot extract the keyword, the rule condition is put into the to-be-proposed group, and the exact matching data set is pre-compiled for the rule condition of the to-be-presented group, and the rule condition bad prompt is issued to the user.
  • the processor is further configured to: call the matching filter to: perform the following operations: after the matching of the to-be-filtered content by using the packet matching data set, the method further includes: when the content to be filtered When the keyword is not matched, the exact matching data set corresponding to the rule condition of the to-be-prompted packet is used to perform exact matching of the rule condition on the to-be-filtered content that does not match the keyword.
  • the processor is configured to execute instructions in the memory to: extract the keywords from the input one or more rule conditions, respectively: according to the set period, from one or more rules that have been entered Extract keywords in the condition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Embodiments of the present invention provide a content filtration method and device. The method comprises: respectively extracting a keyword from entered rule conditions; dividing the rule conditions into one or more groups according to the extracted keyword, and pre-compiling a group matching dataset for the extracted keyword; respectively pre-compiling a precise matching dataset for the rule conditions of the groups corresponding to the extracted keyword; obtaining to-be-filtered content; using the group matching dataset to perform keyword matching on the to-be-filtered content; using the precise matching dataset of the rule conditions of the groups corresponding to the matched keyword to perform precise matching of the rule conditions on the to-be-filtered content; and executing a corresponding filtration policy according to a matching result of the precise matching. The present invention performs group pre-filtration on the rule conditions; therefore the number of the rule conditions in each group is small, and occupied memory is reduced. However, the precise matching based on the rule conditions after the group pre-filtration has a higher matching accuracy.

Description

内容过滤方法和装置  Content filtering method and device
技术领域 Technical field
本发明实施例涉及数据处理技术, 尤其涉及一种内容过滤方法和装 置。 背景技术  Embodiments of the present invention relate to data processing technologies, and in particular, to a content filtering method and apparatus. Background technique
互联网作为全球最大的信息中心, 正以惊人的速度发展壮大, 但是其中 的信息良莠不齐, 存在为数不少的不良网站、 不良资源。 另外还存在一些包 含恶意软件的可疑网站, 会威胁到用户的个人隐私甚至破坏用户的电脑。  As the world's largest information center, the Internet is growing at an alarming rate, but the information is mixed, and there are many bad websites and bad resources. There are also suspicious websites that contain malware that can threaten the user's personal privacy or even damage the user's computer.
为避免不良信息的危害, 现有技术采用了基于应用层协议的内容过滤技 术对网页进行过滤。 例如, 对于企业网网关来说, 可以通过配置过滤策略来 过滤某些类型内容的网页, 从而达到限制企业网内部用户进行禁止的行为, 例如禁止访问不良网站或观看在线电影等。  In order to avoid the harm of bad information, the prior art uses a content filtering technology based on an application layer protocol to filter web pages. For example, for an enterprise network gateway, a filtering policy can be configured to filter webpages of certain types of content, thereby restricting behaviors prohibited by internal users of the enterprise network, such as prohibiting access to bad websites or watching online movies.
现有技术典型是通过超文本传输协议(Hyper Text Transfer Protocol , 简称 HTTP )请求消息中的目标统一资源定位符(Uniform Univer sa l Resource Locator , 简称 URL )地址来对目标网站进行分类。 如果发现网页属于应该过 滤的种类, 例如色情、 暴力等, 则把 HTTP请求重定向到另外一个提示页面, 或者直接把网络连接断开。  The prior art typically classifies a target website by using a Uniform Universa Resource Locator (URL) address in a Hypertext Transfer Protocol (HTTP) request message. If the web page is found to be of a type that should be filtered, such as pornography, violence, etc., redirect the HTTP request to another prompt page, or disconnect the network connection directly.
现有内容过滤技术一般是由用户预先设定规则条件和过滤条件, 采用预 编译的过滤器对请求打开网页的 URL地址与规则条件相匹配, 对与规则条件 匹配一致的 URL地址, 再按照过滤条件进行阻断或放行等处理。 规则条件例 如可以为 " i f URL含有 s ina" 、 "i f URL等于 www. abc. com" 等单个的字 符串匹配条件,各条规则条件可以基于确定有限状态自动机(De termini s t ic Fini te-Sta te Automata, 简称 DFA )算法形成 DFA图, 各网页地址基于 DFA 图进行精确匹配以判断是否与规则条件一致。 过滤条件例如可以为 "当满足 " if URL含有 s ina" 时就执行将网页放行的策略" , 或 "当满足 " if URL 等于 www. abc. com" 时就将该网页阻断或重定向的策略" 。 所以需要将与规 则条件匹配一致的网页地址进一步在过滤条件中匹配, 以便确定该执行哪种 处理策略。 The existing content filtering technology generally pre-sets the rule conditions and filtering conditions by the user, and uses the pre-compiled filter to match the URL address of the requested webpage with the rule condition, and the URL address that matches the rule condition is matched, and then filters. Conditions are blocked or released. The rule condition may be, for example, a single string matching condition such as "if URL contains s ina" and "if URL equals www.abc.com", and each rule condition may be based on determining a finite state automaton (De termini st ic Fini te- The Sta te Automata (DFA) algorithm forms a DFA map, and each web page address is accurately matched based on the DFA map to determine whether it is consistent with the rule conditions. The filter condition may be, for example, "the policy of releasing the webpage when the "if" URL contains s ina", or "blocking or redirecting the webpage when the "if URL" is equal to www.abc.com"Strategy". Therefore, it is necessary to further match the webpage address that matches the rule condition in the filter condition to determine which one to execute. Processing strategy.
但是, 现有技术的这种内容过滤技术存在较大缺陷。 对 URL地址进行内 容过滤采用的规则条件匹配方法是通过采用 DFA图进行的, 当规则条件的数 量太多或者要求支持复杂规则条件配置, 例如包括通配符的正则表达式型, 诸 口 ". */abc. */news" 、 ". *\. www\. doma in. *\. com" 等, U'J会遇 ilj耗用大 量内存的问题。 这个是 DFA算法的主要缺点, 现有技术可以采用压缩的 DFA, 如 D2FA ( Delayed DFA )算法代替标准 DFA进行匹配, 但是会造成匹配性能 低下, 因为 D2FA算法的时间效率比标准 DFA低好几倍。  However, such content filtering techniques of the prior art have major drawbacks. The rule condition matching method for content filtering of URL addresses is performed by using DFA graphs. When the number of rule conditions is too large or requires support for complex rule condition configuration, for example, a regular expression type including a wildcard, ". */ Abc. */news" , ". *\. www\. doma in. *\. com", etc., U'J will encounter the problem of ilj consuming a lot of memory. This is the main drawback of the DFA algorithm. The prior art can use a compressed DFA, such as the D2FA (Delayed DFA) algorithm instead of the standard DFA for matching, but the matching performance is low because the time efficiency of the D2FA algorithm is several times lower than the standard DFA.
所以, 如何兼顾内容过滤技术中内存占用量和匹配性能, 成为现有技术 中需要解决的技术问题。 发明内容  Therefore, how to balance the memory footprint and matching performance in the content filtering technology has become a technical problem to be solved in the prior art. Summary of the invention
本发明实施例提供一种内容过滤方法和装置, 以减小内容过滤的内存 占用且获得良好的匹配效果。  Embodiments of the present invention provide a content filtering method and apparatus to reduce memory usage of content filtering and obtain a good matching effect.
本发明实施例提供了一种内容过滤方法, 包括:  An embodiment of the present invention provides a content filtering method, including:
从输入的一条或多条规则条件中分别提取关键字;  Extract keywords from one or more rule conditions entered;
根据提取的关键字对所述一条或多条规则条件划分成一个或多个分 组, 使得同一分组中的规则条件具有相同的关键字, 并为所述提取的关键 字预编译分组匹配数据集合;  Dividing the one or more rule conditions into one or more packets according to the extracted keywords, so that the rule conditions in the same group have the same keyword, and pre-compiling the group matching data set for the extracted keywords;
分别为所述提取的关键字中的各关键字对应分组的规则条件预编译 精确匹配数据集合;  Pre-compiling the exact matching data set for the rule condition of each of the extracted keywords corresponding to the grouping;
获取待过滤内容;  Get the content to be filtered;
利用所述分组匹配数据集合, 对所述待过滤内容进行关键字的匹配, 得到匹配到的关键字;  Using the packet matching data set, performing keyword matching on the to-be-filtered content to obtain a matched keyword;
利用匹配到的关键字对应分组的规则条件的精确匹配数据集合, 对所 述待过滤内容进行规则条件的精确匹配;  Using the exact matching data set of the rule condition of the matched keyword corresponding group, the rule condition is accurately matched to the content to be filtered;
根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策略。 本发明实施例还提供了一种内容过滤装置, 包括内容获取模块、 内容 过滤模块和策略实施模块, 其中,  Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. An embodiment of the present invention further provides a content filtering apparatus, including a content obtaining module, a content filtering module, and a policy implementation module, where
所述内容获取模块, 用于获取待过滤内容; 所述内容过滤模块包括: The content obtaining module is configured to acquire content to be filtered; The content filtering module includes:
关键字提取单元, 用于从输入的一条或多条规则条件中分别提取 关键字;  a keyword extracting unit, configured to respectively extract keywords from one or more input rule conditions;
分组编译单元, 用于根据提取的关键字对所述一条或多条规则条 件划分成一个或多个分组, 使得同一分组中的规则条件具有相同的关 键字, 并为所述提取的关键字预编译分組匹配数据集合;  a packet compiling unit, configured to divide the one or more rule conditions into one or more packets according to the extracted keywords, so that rule conditions in the same group have the same keyword, and pre-select the extracted keywords Compiling a packet matching data set;
规则条件编译单元, 用于分别为所述提取的关键字中的各关键字 对应分组的规则条件预编译精确匹配数据集合;  a rule condition compiling unit, configured to precompile an exact matching data set for a rule condition of each keyword corresponding to each of the extracted keywords;
分组匹配单元, 用于利用所述分组匹配数据集合, 对所述待过滤 内容进行关键字的匹配,得到匹配到的关键字;  a packet matching unit, configured to perform keyword matching on the to-be-filtered content by using the packet matching data set, to obtain a matched keyword;
规则条件匹配单元, 用于利用匹配到的关键字对应分组的规则条 件的精确匹配数据集合, 对所述待过滤内容进行规则条件的精确匹 -,  a rule condition matching unit, configured to perform an exact match condition of the rule to be filtered by using an exact matching data set of a rule condition of the matched keyword corresponding group,
所述策略实施模块, 用于根据所述精确匹配的匹配结果执行与所述匹 配结果对应的过滤策略。  The policy implementation module is configured to perform a filtering policy corresponding to the matching result according to the matching result of the exact matching.
本发明实施例所提供的内容过滤方法和装置, 由于基于关键字对规则 条件进行了分组预过滤, 所以每组规则条件的数量较少, 构造的与每组规 则条件对应的精确匹配数据集合所占用内存之和, 比将所有规则条件预编 译形成的数据集合占用内存要少。 而分组预过滤后再基于规则条件的精确 匹配, 能够保证待过滤内容与规则条件的精确比较, 具有较高的匹配准确 性。 所以本发明实施例的技术方案在占用较少内存的基础上优化了匹配性 能, 得到了较为准确的匹配结果。 附图说明  The content filtering method and apparatus provided by the embodiments of the present invention, because packet pre-filtering is performed on the rule condition based on the keyword, the number of rule conditions in each group is small, and the exact matching data set corresponding to each set of rule conditions is constructed. The sum of memory usage takes up less memory than a data set formed by precompiling all rule conditions. After the packet is pre-filtered and then based on the exact matching of the rule conditions, the accurate comparison between the content to be filtered and the rule condition can be ensured, and the matching accuracy is high. Therefore, the technical solution of the embodiment of the present invention optimizes the matching performance on the basis of occupying less memory, and obtains a more accurate matching result. DRAWINGS
图 1为本发明实施例一提供的内容过滤方法的流程图;  1 is a flowchart of a content filtering method according to Embodiment 1 of the present invention;
图 2为本发明实施例三提供的内容过滤方法的流程图;  2 is a flowchart of a content filtering method according to Embodiment 3 of the present invention;
图 3为本发明实施例四提供的内容过滤方法的流程图;  3 is a flowchart of a content filtering method according to Embodiment 4 of the present invention;
图 4为本发明实施例五提供的内容过滤方法的流程图;  4 is a flowchart of a content filtering method according to Embodiment 5 of the present invention;
图 5为本发明实施例五所适用实例的流程图;  FIG. 5 is a flowchart of a applicable example of Embodiment 5 of the present invention; FIG.
图 6为本发明实施例六提供的内容过滤装置的结构示意图; 图 7为本发明实施例七提供的内容过滤装置的结构示意图; 图 8为本发明实施例八提供的内容过滤装置的结构示意图; 图 9为本发明实施例九所适用的网络架构示意图; FIG. 6 is a schematic structural diagram of a content filtering apparatus according to Embodiment 6 of the present invention; FIG. FIG. 7 is a schematic structural diagram of a content filtering apparatus according to Embodiment 7 of the present invention; FIG. 8 is a schematic structural diagram of a content filtering apparatus according to Embodiment 8 of the present invention; FIG. 9 is a schematic diagram of a network architecture applicable to Embodiment 9 of the present invention;
图 1 0为本发明实施例九提供的内容过滤方法中提取关键字的过程示 意图;  FIG. 10 is a schematic diagram of a process for extracting a keyword in a content filtering method according to Embodiment 9 of the present invention;
图 1 1为本发明实施例九提供的内容过滤方法中执行过滤流程的示意 图;  FIG. 1 is a schematic diagram of a filtering process performed in a content filtering method according to Embodiment 9 of the present invention; FIG.
图 12为本发明实施例提供的内容过滤方法中分组与算法对应关系示 意图;  FIG. 12 is a schematic diagram showing a correspondence between a packet and an algorithm in a content filtering method according to an embodiment of the present invention;
图 1 3为本发明实施例提供的计算机系统的结构示意图;  FIG. 13 is a schematic structural diagram of a computer system according to an embodiment of the present invention;
图 14为本发明另一实施例提供的计算机系统的结构示意图。 具体实施方式  FIG. 14 is a schematic structural diagram of a computer system according to another embodiment of the present invention. detailed description
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本 发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描 述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提 下所获得的所有其他实施例, 都属于本发明保护的范围。  The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
实施例一  Embodiment 1
图 1为本发明实施例一提供的内容过滤方法的流程图, 本实施例的内 容过滤方法可适用于各种需要对文本内容进行过滤的场景中, 具体可以由 软件和 /或硬件形式来实现, 典型地如基于文本应用层协议执行的网页内 容过滤, 则可以由集成在网关中的软件来实现。  FIG. 1 is a flowchart of a content filtering method according to Embodiment 1 of the present invention. The content filtering method in this embodiment may be applicable to various scenarios that need to filter text content, and may be implemented by software and/or hardware. Web content filtering, typically performed based on a text application layer protocol, can be implemented by software integrated in the gateway.
该内容过滤方法主要包括了对规则条件的预编译流程和对待过滤内 容的过滤流程, 具体包括如下步骤:  The content filtering method mainly includes a pre-compilation process for the rule condition and a filtering process for the content to be filtered, and specifically includes the following steps:
步骤 1 1 0、 从输入的一条或多条规则条件中分别提取关键字; 步骤 1 20、 根据提取的关键字对所述一条或多条规则条件划分成一个 或多个分组, 使得同一分组中的规则条件具有相同的关键字, 并为所述提 取的关键字预编译分组匹配数据集合;  Step 1 1 0: Extract keywords from one or more input rule conditions respectively; Step 1 20: Divide the one or more rule conditions into one or more groups according to the extracted keywords, so that the same group is in the same group Rule conditions have the same keyword, and precompile the packet matching data set for the extracted keywords;
步骤 1 30、 分别为所述提取的关键字中的各关键字对应分組的规则条 件预编译精确匹配数据集合; Step 1 30: A rule strip corresponding to each keyword in each of the extracted keywords Precompiled exact match data set;
上述步骤 1 1 0-1 30为预编译流程, 是对用户输入的各规则条件进行编 译处理, 以便在执行过滤流程时, 能对待过滤内容进行迅速匹配。  The above steps 1 1 0-1 30 are pre-compilation processes, which are to compile and process the rule conditions input by the user, so as to quickly match the filtered content when the filtering process is executed.
步骤 140、 获取待过滤内容;  Step 140: Obtain content to be filtered.
步骤 1 50、 利用所述分组匹配数据集合, 对所述待过滤内容进行关键 字的匹配,得到匹配到的关键字;  Step 1 50: Perform, by using the group matching data set, a keyword matching of the to-be-filtered content to obtain a matched keyword;
步骤 160、 利用匹配到的关键字对应分组的规则条件的精确匹配数据 集合, 对所述待过滤内容进行规则奈件的精确匹配;  Step 160: Perform exact matching of the ruled content on the to-be-filtered content by using the exact matching data set of the rule condition of the matched keyword corresponding grouping;
步骤 170、 根据所述精确匹配的匹配结果执行与所述匹配结果对应的 过滤策略。  Step 170: Perform a filtering policy corresponding to the matching result according to the matching result of the exact matching.
上述步骤 140-17 0为内容过滤流程, 是基于预编译过程构造的匹配数 据集合对待过滤内容进行匹配的操作。  The above steps 140-17 0 are content filtering processes, which are operations for matching the filtered content based on the matching data set constructed by the pre-compilation process.
内容过滤技术中适用于规则条件和过滤规则的匹配数据集合可称为 内容过滤规则库, 规则条件和过滤规则一般是由管理员等用户动态配置 的, 而不是由设备提供商定期手动 /远程更新的。 所以, 如何根据用户输 入的规则条件和过滤规则自动地构造出高效的内容过滤规则库是实现内 容过滤方法的关键问题。  The matching data set in the content filtering technology applicable to the rule condition and the filtering rule may be referred to as a content filtering rule base, and the rule condition and the filtering rule are generally dynamically configured by a user such as an administrator, instead of being manually/remotely updated by the device provider periodically. of. Therefore, how to automatically construct an efficient content filtering rule base based on the rule conditions and filtering rules entered by the user is a key issue in implementing the content filtering method.
通常在实施内容过滤技术时, 用户会输入多条规则条件, 可采用正则 表达式来表示, 规则条件一般是文本应用协议中某个字段所匹配的内容。 若在过滤流程中需要匹配多个字段,例如,不同的字段可以包括 URL地址、 内容类型 ( Con t en t- Type ) 头域、 用户代理( User- Agen t ) 头域等, 则可 以针对不同字段, 为每个字段对应的规则条件分别执行预编译流程。 本实 施例所执行的预编译流程以一个字段为例进行说明, 若为多个字段内容的 规则条件则重复执行本实施例的技术方案即可。  Usually when implementing the content filtering technology, the user enters multiple rule conditions, which can be represented by a regular expression. The rule condition is generally the content matched by a field in the text application protocol. If multiple fields need to be matched in the filtering process, for example, different fields may include a URL address, a content type (Con t en t-Type) header field, a user agent (User-Agen t) header field, etc., Fields, the precompilation process is performed separately for the rule conditions corresponding to each field. The pre-compilation process executed by this embodiment is described by taking one field as an example. If the rule condition of multiple field contents is repeated, the technical solution of this embodiment may be repeatedly executed.
本实施例的预编译流程中, 所提取的关键字是基于预设策略从规则条 件中提取的, 关键字是能够尽量以少量字符代表该规则条件核心内容的字 段。 提取满足此要求的关键字的预设策略可以有多种实现方式, 将通过后 续实施例进行介绍。 由于所提取的关键字用于反映规则条件的核心内容, 所以基于关键字将规则条件分組, 即通过将具有相同关键字的规则奈件分 为一组来将内容类似的规则条件分在相同组中, 所谓相同关键字, 并不严 格限定为文字相同, 也可以基于预设策略将具有关联的关键字视为具有相 同的关键字。 随后, 一方面为所有关键字预编译一分組匹配数据集合, 另 一方面为各组规则条件分别预编译一精确匹配数据集合。 所谓数据集合即 根据某种内容匹配算法预编译数据, 能够在进行匹配时快速完成字符串的 比对,例如纯字符串匹配算法、非确定有限状态自动机( Nondetermini s t i c Fini te-s ta te Automa ta , 简称 NFA ) 匹酉己算法、 DFA 匹酉己算法等 i 可以作 为匹配数据集合。 In the pre-compilation process of this embodiment, the extracted keywords are extracted from rule conditions based on a preset policy, and the keyword is a field that can represent the core content of the rule condition with a small number of characters as much as possible. The preset policy for extracting keywords that meet this requirement can be implemented in various ways, which will be introduced through subsequent embodiments. Since the extracted keywords are used to reflect the core content of the rule condition, the rule conditions are grouped based on the keywords, that is, the rule conditions with similar contents are grouped into the same group by grouping the rule pieces having the same keyword into one group. In the middle, the same keyword is not strict. The grid is limited to the same text, and the associated keywords can also be considered to have the same keyword based on the preset policy. Subsequently, on one hand, a group matching data set is pre-compiled for all keywords, and on the other hand, an exact matching data set is pre-compiled for each group of rule conditions. The so-called data set pre-compiled data according to a content matching algorithm, which can quickly complete string comparison when performing matching, such as pure string matching algorithm, non-deterministic finite state automaton ( Nondetermini stic Fini te-s ta te Automa Ta, abbreviated as NFA) I can be used as a matching data set, such as the algorithm, the DFA algorithm, and so on.
分组匹配数据集合和精确匹配数据集合优选均采用能够精确匹配字 符串的匹配算法。 例如可考虑性能和内存占用的平衡, 根据内存规格, 一 般来说性能越高的算法, 消耗更多内存, 反之亦然。 大部分网络数据是需 要经过分组匹配算法处理,而少量数据匹配到分组,进一步进行精确匹配。 所以对于关键字的分组匹配算法, 可向提高性能方面倾斜, 保证迅速匹配 获得关键字。对于规则条件的精确匹配算法,可向内存占用少的方向倾斜, 以避免规则条件的大量增加占用过多内存。  Both the packet matching data set and the exact matching data set preferably employ a matching algorithm capable of exactly matching the character string. For example, consider the balance of performance and memory footprint. According to memory specifications, the higher performance algorithm generally consumes more memory, and vice versa. Most of the network data needs to be processed by the packet matching algorithm, while a small amount of data is matched to the packet for further exact matching. Therefore, for the keyword matching algorithm of keywords, it can be tilted to improve performance, and ensure that the keywords are quickly matched. For the exact matching algorithm of the rule condition, it can be tilted in the direction of less memory occupation, so as to avoid the excessive increase of the rule condition and occupy too much memory.
基于预编译流程所构造的分组匹配数据集合和精确匹配数据集合, 当 执行过滤流程时, 首先将待过滤内容利用分組匹配数据集合进行关键字的 匹配, 识别待过滤内容中是否包含关键字, 以及包含哪个关键字。 当匹配 到包含某个关键字时, 则将该待过滤内容利用所匹配到关键字对应组的精 确匹配数据集合进行与规则条件的精确匹配。 匹配结果是能或不能匹配到 规则条件, 此匹配结果可以作为后续进行过滤规则识别或执行相应处理策 略的依据。 当匹配到该待过滤内容不包含关键字时, 则显然也与任何规则 条件不匹配, 可不进行精确匹配, 此匹配结果也可以作为执行后续过滤策 略的依据。  Based on the packet matching data set and the exact matching data set constructed by the pre-compilation process, when the filtering process is executed, the content to be filtered is first matched with the group matching data set to identify whether the keyword to be filtered contains keywords, and Which keyword is included. When it is matched to include a certain keyword, the content to be filtered is accurately matched with the rule condition by using the accurate matching data set matched to the corresponding group of the keyword. The matching result can or cannot be matched to the rule condition. This matching result can be used as the basis for subsequent filtering rule identification or execution of the corresponding processing strategy. When the matching content to be filtered does not contain a keyword, it obviously does not match any rule condition, and the exact matching may not be performed. The matching result may also be used as a basis for executing the subsequent filtering policy.
本实施例的技术方案, 由于基于关键字对规则条件进行了分组预过 滤, 所以每组规则条件的数量较少, 构造的各个精确匹配数据集合所占用 内存之和比所有规则条件编译的数据集合占用内存要少。 而分组预过滤后 再基于规则条件的精确匹配, 能够保证待过滤内容与规则条件的精确比 较, 具有较高的匹配准确性。 所以本实施例的技术方案在占用较少内存的 基础上优化了匹配性能, 得到了较为准确的匹配结果。  In the technical solution of the embodiment, since the group pre-filtering is performed on the rule condition based on the keyword, the number of rule conditions in each group is small, and the sum of the memory occupied by each of the constructed exact matching data sets is larger than the data set compiled by all the rule conditions. Take up less memory. After the packet is pre-filtered and then based on the exact matching of the rule conditions, the content to be filtered can be accurately compared with the rule conditions, and the matching accuracy is high. Therefore, the technical solution of the embodiment optimizes the matching performance on the basis of occupying less memory, and obtains a more accurate matching result.
在上述实施例的基础上, 步骤 11 0提取关键字的操作, 还会存在无法 按照预设策略提取出关键字的可能, 遇到此类情况, 可以将无法提取关键 字的规则条件丢弃, 但优选是执行下述操作: On the basis of the above embodiment, step 11 0 extracts the operation of the keyword, and there is still The possibility of extracting a keyword according to a preset policy. In such a case, the rule condition for which the keyword cannot be extracted may be discarded, but it is preferable to perform the following operations:
当识别出输入的规则条件无法提取关键字时, 将该规则条件放入待提 示分组, 并为所述待提示分组的规则条件预编译精确匹配数据集合, 并向 用户发出规则条件不良提示。  When it is recognized that the input rule condition cannot extract the keyword, the rule condition is put into the to-be-proposed group, and the exact matching data set is pre-compiled for the rule condition of the to-be-presented group, and the rule condition bad prompt is issued to the user.
相应地, 过滤流程中, 在利用所述分组匹配数据集合, 对所述待过滤 内容进行关键字的匹配之后, 还包括: 当待过滤内容未匹配到关键字时, 利用所述待提示分组的规则条件对应的精确匹配数据集合, 对未匹配到关 键字的所述待过滤内容进行规则条件的精确匹配。  Correspondingly, after the matching of the to-be-filtered content by using the packet matching data set, the method further includes: when the content to be filtered does not match the keyword, using the to-be-prompted packet The exact matching data set corresponding to the rule condition performs an exact matching of the rule conditions on the to-be-filtered content that does not match the keyword.
上述出现无法提取关键字的情况, 说明包含此类规则条件的待过滤内 容无法根据关键字首先进行分组再进行精确匹配, 只能进行完全的精确匹 配。 将不包含关键字的待过滤内容全部进行精确匹配能够进一步保证所有 过滤的准确性, 但这将是不利于减少内存的, 同时, 此类规则条件的精确 匹配性能通常也低于分组匹配, 因而对时间性能消耗较大。 所以出现此类 情况可以向用户发出规则条件不良提示, 告知此类规则条件将加重系统的 时间和空间性能的负担, 应尽量避免设置此类规则条件。  In the above case, the keyword cannot be extracted. It indicates that the content to be filtered containing the conditions of such a rule cannot be grouped according to the keyword and then matched exactly, and only a complete exact match can be performed. Accurately matching all the content to be filtered without keywords can further ensure the accuracy of all filtering, but this will not be conducive to reducing memory. At the same time, the exact matching performance of such rule conditions is usually lower than the packet matching. It consumes a lot of time performance. Therefore, such a situation can send a bad condition to the user, indicating that such rule conditions will increase the burden of the system's time and space performance, and should avoid setting such rule conditions.
本实施例中, 获取待过滤内容可以是对接收到的数据包采用深度报文 识別 (Deep Packet Inspection, 简称 DPI )技术进行协议识别, 一般来 说,进行内容过滤的文本类协议类型包括 HTTP、会话初始化协议( Session Initiation Protocol, 简称 SIP)、 实时流传输协议( Real Time Streaming Protocol, 简称 RTSP)等协议类型; 基于识别到的协议, 对数据包进行字 段解析, 以获取至少一个预设字段, 各预设字段分别作为待过滤内容, 以 便分别执行后续的分组匹配、 精确匹配和过滤匹配操作。 其中, 所述过滤 规则由一条或多条规则条件组合而成, 且所迷过滤规则由对应于一个或多 个预设字段的一条或多条规则条件组合而成。 例如, 预设字段可以包括 HTTP协议数据包中的 HTTP消息的请求方法、 请求 URL、 内容类型  In this embodiment, the content to be filtered may be a deep packet inspection (DPI) technology for protocol identification of the received data packet. Generally, the text type protocol type for content filtering includes HTTP. a protocol type such as a Session Initiation Protocol (SIP) or a Real Time Streaming Protocol (RTSP); based on the identified protocol, performing field parsing on the data packet to obtain at least one preset field Each preset field is respectively used as the content to be filtered, so as to perform subsequent group matching, exact matching, and filtering matching operations respectively. The filtering rule is a combination of one or more rule conditions, and the filtering rule is formed by combining one or more rule conditions corresponding to one or more preset fields. For example, the preset field may include a request method of an HTTP message in an HTTP protocol packet, a request URL, and a content type.
( Content-Type ) 头域、 用户代理 ( User-Agent ) 头域等。  ( Content-Type ) Header field, User-Agent header field, etc.
实施例二  Embodiment 2
本发明实施例二提供的内容过滤方法, 可以以上述实施例为基础, 进 一步改进了过滤规则的预编译和过滤过程。 在上述实施例中, 过滤规则的 预编译和过滤可以基于多种技术执行, 例如, 匹配到规则条件后记录对应 的标识, 然后基于标识在各条过滤规则中分别匹配适用于何种过滤规则, 而后执行相应的过滤策略。 或者采用树形结构构造各条过滤规则, 将匹配 到的规则条件在树形结构中匹配。 The content filtering method provided by the second embodiment of the present invention may further improve the pre-compilation and filtering process of the filtering rule based on the foregoing embodiment. In the above embodiment, the filtering rule Pre-compilation and filtering can be performed based on various technologies. For example, after matching the rule conditions, the corresponding identifiers are recorded, and then the filtering rules are respectively matched in the respective filtering rules based on the identification, and then the corresponding filtering policies are executed. Or use a tree structure to construct various filtering rules, and match the matched rule conditions in the tree structure.
本实施例提供了另一种优选的过滤规则匹配方案, 在预编译流程的任 意时刻, 执行如下步骤:  This embodiment provides another preferred filtering rule matching scheme. At any time of the pre-compilation process, the following steps are performed:
为所述一条或多条规则条件分别分配唯一的条件标识, 为过滤规则预 编译过滤匹配数据集合, 其中, 所迷过滤规则由所迷一条或多条规则条件 组合而成, 且利用所述一条或多条规则条件的条件标识作为字符来表达所 述过滤规则, 即具体是将字符形式表达的过滤规则预编译成过滤匹配数据 集合, 例如 DFA、 D2FA状态机等;  Assigning a unique condition identifier to the one or more rule conditions, and pre-compiling the filter matching data set for the filtering rule, where the filtering rule is formed by combining one or more rule conditions, and using the one Or the condition identifier of the multiple rule conditions is used as a character to express the filtering rule, that is, the filtering rule expressed in the form of a character is pre-compiled into a filter matching data set, such as a DFA, a D2FA state machine, or the like;
则在过滤流程中, 在根据所述精确匹配的匹配结果执行与所述匹配结 果对应的过滤策略包括:  Then, in the filtering process, performing a filtering policy corresponding to the matching result according to the matching result of the exact matching includes:
利用所述过滤匹配数据集合, 将待过滤内容精确匹配到的规则条件的 条件标识作为字符, 对所述字符进行过滤规则的匹配,所述待过滤内容精 确匹配到的规则条件由所述对待过滤内容进行规则条件的精确匹配得到。  Using the filter matching data set, the condition identifier of the rule condition to which the content to be filtered is exactly matched is used as a character, and the filter rule is matched to the character, and the rule condition to which the content to be filtered is accurately matched is filtered by the rule. The content is precisely matched to the rule conditions.
过滤规则通常由一条或多个规则条件组成, 当这些规则条件都被待过 滤内容满足时, 才算过滤规则匹配成功, 会对应执行相应的过滤策略, 例 如网页重定向到一个提示页面, 告知用户其请求已经被阻止; 直接丢弃网 页并重置客户端连接; 放行网页等过滤策略。  The filtering rule is usually composed of one or more rule conditions. When the conditions of the rule are satisfied by the content to be filtered, the filtering rule is successfully matched, and the corresponding filtering policy is executed correspondingly, for example, the webpage is redirected to a prompt page to inform the user. The request has been blocked; the web page is directly discarded and the client connection is reset; the filtering policy such as the web page is released.
本实施例将规则条件的条件标识作为字符, 则过滤规则的形式为条件 标识构成的字符串, 即将条件规则的条件标识转换为正则表达式, 能够将 多条过滤规则进行统一的预编译, 实现多模匹配, 而后通过一次的匹配即 可得出待过滤内容符合哪条过滤规则, 无需多次查询, 优化了过滤性能。  In this embodiment, the condition identifier of the rule condition is used as a character, and the form of the filter rule is a character string formed by the condition identifier, that is, the condition identifier of the condition rule is converted into a regular expression, and multiple filter rules can be uniformly pre-compiled and realized. Multi-mode matching, and then one-time matching can be used to determine which filtering rule is to be filtered, and no need to query multiple times to optimize filtering performance.
下面提供实例来说明。 假设过滤规则可以为 " I f doma in =  An example is provided below to illustrate. Suppose the filter rule can be " I f doma in =
"www\. porn. *\. com" and (User-Agent = ". *Chrome" or User-Agent = " . *Fi ref ox" ) and Content-Type = Any then Redi rec t . " , 其含义 是, 如果使用 "Chrome"或者 "F i ref ox"浏览器访问 "www\. porn. *\. com" 成人网站,那么重定向此消息到一个提示已被过滤网页。 "Content-Type" 可以是任意内容, 此处可以省略, 仅为解释方案思想而保留。 假设各规则条件的条件标识如下: "www\. porn. *\. com" and (User-Agent = ". *Chrome" or User-Agent = " . *Fi ref ox" ) and Content-Type = Any then Redi rec t . " , meaning Yes, if you use the "Chrome" or "F i ref ox" browser to access the "www\. porn. *\. com" adult website, then redirect this message to a prompt that has been filtered. "Content-Type" can It is arbitrary content, which can be omitted here and is reserved only for explaining the idea of the solution. Assume that the conditions of each rule condition are identified as follows:
"www\. porn. *\. com" = \x87  "www\. porn. *\. com" = \x87
". *Chrome" = \x91  ". *Chrome" = \x91
". *Firef ox" = \xl 3  ". *Firef ox" = \xl 3
则可以将过滤规则直接转换成正则表达式: You can then convert the filter rules directly into regular expressions:
Figure imgf000011_0001
Figure imgf000011_0001
如果有多条过滤规则,则同理都编译成一起,组成过滤匹配数据集合, 例如一个 DFA或 D2FA状态机, 进行匹配的时候, 按照过滤规则预定义的 顺序执行:  If there are multiple filtering rules, the same reason is compiled together to form a filter matching data set, such as a DFA or D2FA state machine. When matching, it is executed in the order predefined by the filtering rules:
第一个待过滤内容是 "Domain" 字段, 记录待过滤内容匹配到的规则 条件的条件标识;  The first content to be filtered is a "Domain" field, which records the condition identifier of the rule condition to which the content to be filtered matches;
第二个待过滤内容是 "User-Agent" 字段, 记录待过滤内容匹配到的 规则条件的条件标识;  The second content to be filtered is a "User-Agent" field, which records the condition identifier of the rule condition to which the content to be filtered matches;
第三个待过滤内容是 "Content-Type" 字段, 记录待过滤内容匹配到 的规则条件的条件标识, 注意正则表达式最后一个字符是 "." , 表示任 意;  The third content to be filtered is the "Content-Type" field, which records the condition identifier of the rule condition to which the content to be filtered matches. Note that the last character of the regular expression is ".", indicating any;
而后利用过滤匹配数据集合, 将匹配到的条件标识进行过滤规则的匹 配, 即可获知该执行何种过滤策略。  Then, by using the filter matching data set, matching the matched condition identifiers to the filtering rules, the filtering policy can be learned.
这样, 如果有多条过滤规则需要匹配, 仅需要按照顺序将各条件标识 进行一次匹配即可, 不必逐条匹配, 性能显著提高。 同时, 可以采用 D2FA 而不是 DFA以节约内存。  In this way, if there are multiple filtering rules that need to be matched, it is only necessary to match each condition identifier once in order, and it is not necessary to match one by one, and the performance is significantly improved. At the same time, you can use D2FA instead of DFA to save memory.
当条件标识的数量大于 255, 即单个字符无法作为条件标识时, 可以 使所有规则条件都采用双字节条件标识,例如下文第三条条件标识是 525, 即十六进制 0x020d时。  When the number of condition identifiers is greater than 255, that is, a single character cannot be used as a condition identifier, all rule conditions can be identified by a double-byte condition. For example, the third condition identifier below is 525, that is, when hexadecimal 0x020d.
"www\. porn. *\. com" = \x87  "www\. porn. *\. com" = \x87
". *Chrome" = \x91  ". *Chrome" = \x91
". *Firefox" = \x02\x0d  ". *Firefox" = \x02\x0d
过滤规则的表达式则转换为,  The expression of the filter rule is converted to
"A\x00\x87\x00\x91\x02\x0d.. " " A \x00\x87\x00\x91\x02\x0d.. "
实施例三 图 2为本发明实施例三提供的内容过滤方法的流程图。 在上述实施例 中介绍了在初始阶段对用户输入的规则条件和过滤规则进行的预编译处 理, 实际应用中, 用户可以随时新增、 删除和更改规则条件和过滤规则, 更改操作相当于先删除再新增的操作。 本实施例主要优化新增规则奈件的 操作, 则上述内容过滤方法进一步可执行如下操作: Embodiment 3 FIG. 2 is a flowchart of a content filtering method according to Embodiment 3 of the present invention. In the above embodiment, the pre-compilation processing of the rule condition and the filtering rule input by the user is introduced in the initial stage. In the actual application, the user can add, delete, and change the rule condition and the filtering rule at any time, and the change operation is equivalent to deleting first. Additional actions. In this embodiment, the operation of the newly added rule component is optimized, and the content filtering method further performs the following operations:
步骤 2 1 0、 当获取到新增的规则条件时, 从新增的规则条件中提取关 键字;  Step 2 1 0. When the newly added rule condition is obtained, the keyword is extracted from the newly added rule condition;
步骤 220、 根据从新增的规则条件中提取的关键字为新增的规则条件 查找或创建对应的分组, 并重新编译分组匹配数据集合;  Step 220: Search or create a corresponding group according to a keyword extracted from the newly added rule condition, and recompile the group matching data set.
本步骤具体可以首先在已有分組中查找是否存在对应的关键字, 若查 找到没有对应的关键字, 则为该关键字创建新的分組, 并重新编译分组匹 配数据集合, 没查找到有对应的关键字, 则无需重新编译分组匹配数据集 合。  Specifically, the step may first search for an existing keyword in the existing group. If no corresponding keyword is found, a new group is created for the keyword, and the group matching data set is recompiled, and no corresponding correspondence is found. The keywords do not need to recompile the group matching data set.
步骤 2 30、 根据所述新增的规则条件预编译对应分组的规则条件的精 确匹配数据集合;  Step 2 30: Precompile the accurate matching data set of the rule condition of the corresponding group according to the newly added rule condition;
该步驟的操作区分针对已有分组和新建分组的情况, 进行重新编译。 对于采用不同算法实现的数据集合可有不用的编译方法, 如此分组采用 DFA把所有组内规则条件编译成一个状态机, 则必须重新编译整个 DFA状 态机; 若此分组采用逐条单模匹配, 则只需要编译新增的规则条件, 并添 加到匹配链中去。  The operation of this step distinguishes between the existing grouping and the new grouping, and is recompiled. There may be unused compilation methods for data sets implemented by different algorithms. Therefore, if DFA is used to compile all intra-group rule conditions into a state machine, the entire DFA state machine must be recompiled. If the packet uses block-by-single-mode matching, then Just compile the new rule conditions and add them to the matching chain.
步骤 240、 为新增的规则条件分配条件标识, 并重新编译过滤匹配数 据集合。  Step 240: Assign a condition identifier to the newly added rule condition, and recompile the filter matching data set.
本实施例的技术方案可以使用户灵活地增加新的规则条件,新增规则 条件仅需更新分组匹配数据集合、 过滤匹配数据集合和一组精确匹配数据 集合,若新增规则条件未产生新的关键字,则无需更新分组匹配数据集合, 相对于现有技术无需对所有预编译数据集合进行调整。  The technical solution of this embodiment can enable the user to flexibly add new rule conditions. The newly added rule condition only needs to update the group matching data set, the filtered matching data set, and a set of exact matching data sets. If the new rule condition does not generate a new one, For keywords, there is no need to update the group matching data set, and it is not necessary to adjust all the pre-compiled data sets relative to the prior art.
实施例四  Embodiment 4
图 3为本发明实施例四提供的内容过滤方法的流程图。 本实施例以上 述实施例为基础, 进一步优化删除规则条件的操作过程。 该内容过滤方法 还包括如下步骤: 步骤 31 0、 根据输入的规则条件删除指令, 确定待删除的规则条件或 待删除规则条件对应的条件标识, 从待删除规则条件中提取关键字; FIG. 3 is a flowchart of a content filtering method according to Embodiment 4 of the present invention. This embodiment further optimizes the operation process of deleting the rule condition based on the above embodiment. The content filtering method further includes the following steps: Step 31: Delete the instruction according to the input rule condition, determine the rule condition to be deleted or the condition identifier corresponding to the rule condition to be deleted, and extract the keyword from the rule to be deleted;
步骤 320、 根据从待删除规则条件中提取的关键字更新分組匹配数据 集合;  Step 320: Update a group matching data set according to a keyword extracted from a rule to be deleted.
步骤 3 30、 如果需删除所述待删除规则条件, 则对从待删除规则条件 中提取的关键字的对应分组的规则奈件重新编译精确匹配数据集合, 以删 除所述待删除规则条件;  Step 3: If the rule to be deleted is to be deleted, re-compile the exact matching data set with the rule of the corresponding group of the keywords extracted from the rule to be deleted, to delete the rule to be deleted.
当然, 若识别到该关键字的对应组中不存在规则条件了, 则删除该组 的精确匹配数据集合,同时删除该关键字,并重新编译分组匹配数据集合; 步骤 340、 如果需删除所述待删除规则条件对应的条件标识, 则重新 编译所述过滤匹配数据集合, 以删除所述待删除规则条件对应的条件标 识。  Certainly, if the rule condition is not found in the corresponding group of the keyword, the exact matching data set of the group is deleted, the keyword is deleted, and the group matching data set is recompiled; Step 340, if the content needs to be deleted If the condition identifier corresponding to the rule condition is to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the to-be-deleted rule condition.
与实施例三类似, 本实施例可灵活删除规则条件, 且无需对所有预编 译数据集合进行调整。  Similar to the third embodiment, this embodiment can flexibly delete the rule conditions without adjusting all the pre-compiled data sets.
过滤规则的新增、 删除和更改与规则条件类似, 可根据新增的过滤规 则或过滤规则删除指令, 重新编译过滤匹配数据集合, 以新增或删除过滤 规则。  Adding, deleting, and changing filtering rules are similar to the rule conditions. You can recompile the filtering matching data collection according to the newly added filtering rules or filtering rule deletion instructions to add or delete filtering rules.
实施例五  Embodiment 5
图 4为本发明实施例五提供的内容过滤方法的流程图, 在上迷实施例 所提供的内容过滤方法中均涉及关键字的提取, 关键字提取的质量, 直接 关系到后续分组匹配和精确匹配的性能, 以及内容过滤规则库所需占用的 内存大小。 从输入的一条或多条规则条件中分别提取关键字的操作可以有 多种实现方式, 例如包括如下步骤:  4 is a flowchart of a content filtering method according to Embodiment 5 of the present invention. In the content filtering method provided by the foregoing embodiments, keyword extraction is performed, and the quality of keyword extraction is directly related to subsequent packet matching and accuracy. The performance of the match, as well as the amount of memory required by the content filtering rule base. The operations of extracting keywords from one or more of the input rule conditions may be implemented in various ways, for example, including the following steps:
步骤 41 0、 对输入的规则条件, 按照预设划分策略进行字段划分; 步骤 42 0、 基于预设筛选策略对划分后的字段进行筛选得到所述规则 条件的关键字。  Step 41: On the input rule condition, perform field division according to the preset division policy. Step 42: Filter the divided field based on the preset screening policy to obtain the keyword of the rule condition.
其中, 基于预设 选策略对划分后的字段进行 选, 得到所述规则条 件的关键字的操作优选是执行下述流程:  The operation of selecting the divided field based on the preset selection policy, and obtaining the keyword of the rule condition, preferably performs the following process:
从所述划分的字段中, 将与黑名单中字段一致的字段删除; 按照记录的字段误命中次数, 将误命中次数高于命中门限值的字段删 除; From the divided field, the field that matches the field in the blacklist is deleted; according to the number of hits of the recorded field, the field with the number of hits higher than the hit threshold is deleted. Save
针对每个规则条件, 在该规则条件的各关键字中选择该关键字分组的 规则条件数量最少的字段筛选作为该规则条件的关键字。  For each rule condition, the field with the least number of rule conditions for selecting the keyword group among the keywords of the rule condition is selected as the keyword of the rule condition.
但, 本领域人员可以理解, 上述各项也可以独立执行, 或以其他顺序 执行, 还可以增加其他筛选策略, 例如将与白名单中字段一致的字段筛选 为关键字等。  However, those skilled in the art can understand that the above items can also be performed independently or in other orders. Other filtering strategies can be added, such as filtering the fields consistent with the fields in the whitelist as keywords.
实际应用中, 可以根据需要设定多项筛选策略, 且其执行顺序不限, 可以对划分后的字段进行多轮的 选, 以获取能表迷规则条件核心内容的 字段。 本领域技术人员可以理解, 关键字的筛选策略并不限于上述几项。 确定优选的筛选策略的依据是: 关键字的误命中次数越多或误命中率越 高, 则实际匹配性能越低; 分组中规则条件的数量越多, 则占用内存越多。 所以提取关键字的策略要尽量兼顾匹配性能和内存占用的平衡。  In practical applications, multiple screening policies can be set according to requirements, and the execution order is not limited. The divided fields can be selected in multiple rounds to obtain the fields of the core content of the rules. Those skilled in the art can understand that the screening strategy of keywords is not limited to the above items. The basis for determining the preferred screening strategy is: The more the number of missed hits of the keyword or the higher the false hit rate, the lower the actual matching performance; the more the number of rule conditions in the packet, the more memory is occupied. Therefore, the strategy of extracting keywords should try to balance the matching performance and memory usage.
除静态设置之外, 黑名单、 白名单以及误命中次数都可以通过动态统 计进行更新, 例如: 在利用匹配到的关键字对应分组的规则条件的精确匹 配数据集合, 对所述待过滤内容进行规则条件的精确匹配之后, 还包括: 当匹配到关键字的待过滤内容利用所述精确匹配数据集合未匹配到 对应的规则条件时, 更新该关键字的误命中次数记录;  In addition to the static settings, the blacklist, the whitelist, and the number of missed hits can be updated by dynamic statistics, for example: the content to be filtered is subjected to the exact matching data set of the rule condition of the group corresponding to the matched keyword. After the exact matching of the rule condition, the method further includes: when the content to be filtered that matches the keyword does not match the corresponding rule condition by using the exact matching data set, updating the number of missed hits of the keyword;
将误命中次数高于设定门限值的关键字加入黑名单。  Add keywords with missed hits above the set threshold to the blacklist.
通过根据匹配情况来进行动态统计, 能够更新黑名单、 白名单、 误命 中次数的准确性, 以优化关键字提取策略的准确性, 从而优化内容过滤的 匹配性能。优选是可以按照设定周期,以更新后的误命中次数和黑名单等, 在已有规则条件中重新执行提取关键字、 分组、 预编译的操作, 以优化预 编译的数据集合, 获得更优的匹配性能。  By performing dynamic statistics based on the matching situation, the accuracy of the blacklist, whitelist, and number of missed hits can be updated to optimize the accuracy of the keyword extraction strategy, thereby optimizing the matching performance of the content filtering. Preferably, the extraction key, the grouping, and the pre-compilation operation are re-executed in the existing rule condition according to the set period, the number of missed hits and the blacklist, etc., to optimize the pre-compiled data set, and obtain better. Matching performance.
下面以实例方式详细介绍关键字的提取操作, 图 5为本发明实施例五 所适用实例的流程图。  The following describes the extraction operation of the keyword in detail by way of example. FIG. 5 is a flowchart of an applicable example of Embodiment 5 of the present invention.
首先在系统中维护一关键字动态统计表, 如表 1所示, 其中的误命中 次数在内容过滤方法的运行过程中可进行实时刷新, 例如按照设定周期、 或按设定的触发条件来实时刷新。  First, a keyword dynamic statistical table is maintained in the system, as shown in Table 1, wherein the number of missed hits can be refreshed in real time during the running of the content filtering method, for example, according to a set period, or according to a set trigger condition. Refresh in real time.
表 1  Table 1
关键字 误命中次数 此关键字分组的规则条件数量 是否黑名单 huaw 1 2 No goog 5 1 No Keyword hits The number of rule conditions grouped by this keyword is blacklisted Huaw 1 2 No goog 5 1 No
s ina 2 1 No s ina 2 1 No
yaho 1 1 No Yaho 1 1 No
micr 9 2 No Micr 9 2 No
news 0 3 No News 0 3 No
msdn 1 1 No Msdn 1 1 No
www Yes Www Yes
com Yes Com Yes
如上所述, 在内容过滤流程中, 当匹配到某个关键字的待过滤内容, 利用精确匹配数据集合未匹配到对应的规则条件时, 则表明此关键字发生 了误命中, 对应该关键字的误命中次数计数器加 1。  As described above, in the content filtering process, when the content to be filtered of a certain keyword is matched, and the exact matching data set is not matched to the corresponding rule condition, it indicates that the keyword has been hit incorrectly, corresponding to the keyword. The number of missed hits is incremented by 1.
黑名单和白名单可以是静态配置的。 或者, 将误命中次数高于设定门 限值的关键字加入黑名单, 或误命中次数低于设定门限值的关键字加入白 名单。 实际应用中, 既可以将误命中次数作为考虑因素, 也可以将误命中 率作为考虑因素。 该关键字动态维护表需要实时的更新, 随着新关键字的 提取或删除, 以及内容过滤的执行而实时更新。  Blacklists and whitelists can be statically configured. Or, add a keyword with a number of false hits above the set threshold to the blacklist, or add a keyword with a number of false hits below the set threshold to the whitelist. In practical applications, the number of missed hits can be considered as a factor, and the hit rate can be considered as a factor. The keyword dynamic maintenance table needs real-time updates, and is updated in real time as new keywords are extracted or deleted, and content filtering is performed.
步骤 501、获取设备管理员作为用户在线录入字符串形式的规则条件; 例如输入以下规则条件, 规则条件中可以包括通配符 *、 字符数值范 围 [x-y]等:  Step 501: Obtain a rule condition that the device administrator enters the string form as a user online; for example, input the following rule conditions, where the rule condition may include a wildcard *, a range of character values [x-y], and the like:
1. www. huawei^. com  1. www. huawei^. com
2. www [0-3] . huawei. com  2. www [0-3] . huawei. com
3. *google. com/news  3. *google. com/news
4. www. sina [0-9] . com  4. www. sina [0-9] . com
5. www. yahoo*, com/ news  5. www. yahoo*, com/ news
6. *. microsof t, *  6. *. microsof t, *
7. www. msdn. microsof t*/news  7. www. msdn. microsof t*/news
8. www. [a-z] [a-z] [a- z] . com. cn (不良的条件规则)  8. www. [a-z] [a-z] [a- z] . com. cn (bad condition rules)
首先将规则奈件转换成正则表达式, 如将 "." 转换成 "\ , , "*,, 转换成 ". *" 。 步骤 502、 对输入的规则条件按照预设划分策略进行字段划分, 目的 是按照关键字对规则分组; First convert the ruled pieces into regular expressions, such as converting "." to "\ , , "*, and converting to ". *". Step 502: Perform field division on the input rule condition according to a preset division policy, and the purpose is to group the rule according to the keyword;
例如, 根据预设的分隔符 ". " 、 " [" 、 "] " 或空格等来划分字段, 且可以设置字段的字符数, 例如仅截取设定门限值以下数量的字符串, 如 仅提取 4个字符及以下的字段, 则上述的规则条件将字段划分为丽、 huaw、 com、 goog、 s ina、 yaho、 mi cr、 msdn和 news。  For example, the fields are divided according to the preset separators ".", "[", "]" or spaces, etc., and the number of characters of the field can be set, for example, only the number of strings below the set threshold is intercepted, such as only Extracting 4 characters and below, the above rule conditions divide the fields into 丽, huaw, com, goog, s ina, yaho, mi cr, msdn, and news.
步骤 503、 基于表 1所示的关键字动态维护表, 将黑名单中的字段删 除;  Step 503: Delete the field in the blacklist based on the keyword dynamic maintenance table shown in Table 1;
即删除了 www和 com字段, 黑名单中的字段通常是太常用的字段, 无 法起到过滤的目的;  That is, the www and com fields are deleted. The fields in the blacklist are usually too common fields and cannot be filtered.
步骤 50 、 在删除黑名单字段后剩余的字段中, 按照记录的字段误命 中次数, 将误命中次数高于命中门限值的字段删除;  Step 50: In the remaining fields after deleting the blacklist field, delete the field whose hit count is higher than the hit threshold according to the number of hit errors of the recorded field;
如将命中门限值设置为 4 , 则 huaw、 s ina、 yaho , ms dn和 news为筛 选后的字段;  If the hit threshold is set to 4, then huaw, s ina, yaho, ms dn, and news are the filtered fields;
步骤 505、从筛选后的字段中,识別各字段所对应的规则条件的数量, 针对每条规则条件, 在该规则条件的各关键字中选择该关键字分组的规则 条件数量最少的字段筛选作为该规则条件的关键字;  Step 505: Identify, from the filtered field, the number of rule conditions corresponding to each field, and select, for each rule condition, the field with the least number of rule conditions of the keyword group in each keyword of the rule condition. a keyword that is a condition of the rule;
经步骤 505筛选后各规则条件对应的关键字分別是:  The keywords corresponding to each rule condition after being filtered by step 505 are:
1. huaw  Huaw
2. huaw  Huaw
3. news  3. news
4. s ina  4. s ina
5. yaho、 news  5. yaho, news
6. 无关键字  6. No keywords
7. msdn、 news  7. msdn, news
8. 无关键字  8. No keywords
在经步骤 505筛选后, 对于规则条件 5, 由于 yaho和 news的关键字 组中, yaho的规则条件数量为 1, 少于 news分组中规则条件的数量, 所 以规则条件 5选择 yaho作为关键字。 类似的, 规则条件 7选择 ms dn作为 关键字。 表 1中关键字分组的规则条件数量是随着每条规则条件关键字的 确定实时更新的。 After filtering through step 505, for rule condition 5, since the number of rule conditions of yaho is 1 in the keyword group of yaho and news, less than the number of rule conditions in the news group, rule condition 5 selects yaho as a key. Similarly, rule condition 7 selects ms dn as the key. The number of rule conditions for keyword grouping in Table 1 is the keyword of each rule Determine which is updated in real time.
若步骤 505的筛选之前任何步骤完毕时该规则条件已经只剩下一个字 段, 则可直接选取该字段作为关键字。 提取不到关键字的条件规则为不良 的条件规则, 需要向用户发出提示。  If there is only one field left in the rule condition before any step before the screening of step 505, the field can be directly selected as a keyword. Conditional rules that do not extract keywords are bad conditional rules and need to be prompted to the user.
在上述各实施例的技术方案中, 根据关键字对规则条件进行分组, 以 及分组后预编译的精确匹配数据集合可以釆用不同的编译算法。 则分别为 所述提取的关键字中的各关键字对应分组的规则条件预编译精确匹配数 据集合具体可包括:  In the technical solutions of the above embodiments, the rule conditions are grouped according to the keywords, and the accurate matching data set pre-compiled after the grouping can use different compiling algorithms. The pre-compiled exact matching data set of the rule condition corresponding to each keyword in the extracted keywords may specifically include:
对于规则条件的数量小于预配置门限值的分组, 则为该组规则条件采 用 NFA、 DFA或者压缩的 DFA正则表达式匹配算法预编译精确匹配数据集 合, NFA正则表达式匹 算法实现, ij 口 PCRE ( Per l Compa t i b l e Regu l a r Expr es s i on ) , 或采用单模字符串匹配算法预编译精确匹配数据集合, 例 如 BM ( Boyer Moor e ) 匹配算法。 在此步骤中, 在识别到规则条件的数量 小于预配置门限值的分组之后, 可以进一步判断此规则条件中间出现任何 正则表达式相关的元素, 如通配符、 字符范围等, 若是, 则釆用 NFA、 DFA 或压缩的 DFA , 否则采用 BM匹配算法;  For a group of rule conditions whose number is less than the pre-configured threshold, the NFA, DFA, or compressed DFA regular expression matching algorithm is used to pre-compile the exact matching data set for the set of rule conditions, and the NFA regular expression algorithm is implemented, ij port PCRE (Per l Compa tible Regu lar Exp es si on ), or pre-compile an exact matching data set using a single-mode string matching algorithm, such as the BM ( Boyer Moor e ) matching algorithm. In this step, after identifying that the number of rule conditions is less than the pre-configured threshold, it may further determine that any regular expression related elements, such as wildcards, character ranges, etc., occur in the middle of the rule condition, and if so, NFA, DFA or compressed DFA, otherwise BM matching algorithm is used;
对于规则条件的数量等于或大于预配置门限值时, 为该组规则条件采 用 DFA或者压缩的 DFA正则表达式匹配算法把所有规则条件预编译为一个 精确匹配数据集合, 例如 DFA、 D2FA状态机等预配置门限值可以设置为 8, 才能发挥 D2FA多模匹配相对于单模匹配算法逐条匹配的性能优势。 或者 倾向空间性能而不考虑规则数量, 一律采用 NFA正则表达式匹配算法把规 则条件逐条预编译精确匹配结构;  When the number of rule conditions is equal to or greater than the pre-configured threshold, the DFA or compressed DFA regular expression matching algorithm is used to precompile all rule conditions into an exact matching data set for the set of rule conditions, such as DFA, D2FA state machine. The pre-configured threshold can be set to 8, in order to take advantage of the performance of the D2FA multi-mode matching one-by-one matching with the single-mode matching algorithm. Or prefer spatial performance without considering the number of rules, and always use the NFA regular expression matching algorithm to pre-compile the rule conditions to the exact matching structure one by one;
对于包括具有设定复杂定义参数的规则条件的分组, 为该组规则条件 采用 NFA或者压缩的 DFA正则表达式匹配算法预编译精确匹配数据集合。 所谓具有设定复杂定义参数的规则奈件, 可以是根据经验预设的满足某种 复杂程度定义参数的规则条件, 此类规则条件若编译成 DFA状态机会导致 状态数量急剧增加占用大量内存, 例如浮动的、 且带有 " *,, 、 "? " 、 For groupings that include rule conditions with set complex definition parameters, the NFA or compressed DFA regular expression matching algorithm is used to precompile the exact matching data set for the set of rule conditions. The so-called rule with set complex definition parameters may be a rule condition that is defined by experience to satisfy a certain degree of complexity to define a parameter. If such a rule condition is compiled into a DFA state opportunity, the number of states is sharply increased to occupy a large amount of memory, for example Floating, with "*,,,"? " ,
"+" 多次重复通配符的规则条件等。 所谓浮动是指预期出现的模式串的 位置不是固定的。 "+" Repeats the rule conditions of the wildcard multiple times. Floating means that the position of the expected pattern string is not fixed.
例如, 上述实例中, 根据筛选出的关键字对规则条件进行分组, 在分 组的预配置门限值设为 2的情况下, 则分组情况和各组所采用的精确匹配 数据集合可如下表 2所示: For example, in the above example, the rule conditions are grouped according to the selected keywords, When the group's pre-configured threshold is set to 2, the grouping situation and the exact matching data set used by each group can be as shown in Table 2 below:
表 2  Table 2
Figure imgf000018_0001
Figure imgf000018_0001
当然, 实际应用中, 各分组所采用的算法并不限于表 2所示, 如图 12 所示, 也可以为不同分组选择其他预编译。  Of course, in practical applications, the algorithms used in each group are not limited to those shown in Table 2. As shown in Figure 12, other pre-compilations can also be selected for different groups.
实施例六  Embodiment 6
图 6为本发明实施例六提供的内容过滤装置的结构示意图, 该内容过 滤装置可集成于企业网关等实施内容过滤的设备中, 用于执行本发明所提 供的内容过滤方法。 该内容过滤装置具体包括内容获取模块 61 0、 内容过 滤模块 620和策略实施模块 630。 其中, 内容获取模块 61 0用于获取待过 滤内容; 内容过滤模块 620具体包括: 关键字提取单元 621、 分组编译单 元 622、 规则条件编译单元 623、 分组匹配单元 624和规则条件匹配单元 625。 关键字提取单元 621用于从输入的一条或多条规则条件中分别提取 关键字; 分组编译单元 622用于根据提取的关键字对所述一条或多条规则 条件划分成一个或多个分组, 使得同一分组中的规则条件具有相同的关键 字, 并为所述提取的关键字预编译分组匹配数据集合; 规则条件编译单元 62 3用于分别为所述提取的关键字中的各关键字对应分组的规则条件预编 译精确匹配数据集合;分组匹配单元 用于利用所述分组匹配数据集合, 对所述待过滤内容进行关键字的匹配,得到匹配到的关键字; 规则条件匹 配单元 625用于利用匹配到的关键字对应分组的规则条件的精确匹配数据 集合, 对所述待过滤内容进行规则奈件的精确匹配。 该策略实施模块 6 30 用于根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策略。 上述技术方案, 通过关键字分组, 对待过滤内容执行预过滤, 而后进 行精确匹配, 能够有效兼顾内存占用和匹配性能的精确度, 提供了优化的 内容过滤方案。 FIG. 6 is a schematic structural diagram of a content filtering apparatus according to Embodiment 6 of the present invention. The content filtering apparatus may be integrated into an apparatus for performing content filtering, such as an enterprise gateway, for performing the content filtering method provided by the present invention. The content filtering device specifically includes a content obtaining module 61 0, a content filtering module 620, and a policy implementation module 630. The content obtaining module 610 is configured to obtain the content to be filtered. The content filtering module 620 specifically includes: a keyword extracting unit 621, a packet compiling unit 622, a rule condition compiling unit 623, a packet matching unit 624, and a rule condition matching unit 625. The keyword extracting unit 621 is configured to respectively extract keywords from the input one or more rule conditions; the grouping and compiling unit 622 is configured to divide the one or more rule conditions into one or more groups according to the extracted keywords, Making the rule conditions in the same group have the same keyword, and pre-compiling the group matching data set for the extracted keyword; the rule condition compiling unit 62 3 is configured to respectively correspond to each keyword in the extracted keyword The grouping rule condition pre-compiling the exact matching data set; the group matching unit is configured to perform keyword matching on the to-be-filtered content by using the packet matching data set to obtain a matched keyword; the rule condition matching unit 625 is configured to: The exact matching data set of the rule condition of the matched keyword corresponding to the matched keyword is used to perform exact matching of the ruled content. The policy implementation module 6 30 And a method for performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. The above technical solution provides pre-filtering of the filtered content by keyword grouping, and then performs exact matching, which can effectively balance the memory occupancy and matching performance precision, and provides an optimized content filtering scheme.
在上述技术方案的基础上, 该内容过滤模块 62 0还可以进一步包括过 滤规则编译单元 626。 所述策略实施模块 6 30包括过滤规则匹配单元 6 31 和策略实施单元 632。 其中, 过滤规则编译单元 626用于为所述一条或多 条规则条件分别分配唯一的条件标识, 为过滤规则预编译过滤匹配数据集 合, 其中, 所述过滤规则由一条或多条规则条件组合而成, 且利用所述一 条或多条规则条件的条件标识作为字符来表达所述过滤规则; 过滤规则匹 配单元 6 31用于利用所述过滤匹配数据集合, 将待过滤内容精确匹配到的 规则条件的条件标识作为字符, 对所述字符进行过滤规则的匹配,所述待 过滤内容精确匹配到的规则条件由所述对待过滤内容进行规则条件的精 确匹配得到; 策略实施单元 632用于根据所述过滤规则的匹配结果执行与 所述匹配结果对应的过滤策略。  Based on the foregoing technical solution, the content filtering module 62 may further include a filtering rule compiling unit 626. The policy enforcement module 6 30 includes a filter rule matching unit 631 and a policy enforcement unit 632. The filtering rule compiling unit 626 is configured to separately allocate a unique condition identifier for the one or more rule conditions, and pre-compile the filter matching data set for the filtering rule, where the filtering rule is combined by one or more rule conditions. And the conditional identifier of the one or more rule conditions is used as a character to express the filtering rule; the filtering rule matching unit 6 31 is configured to use the filtering matching data set to accurately match the to-filtered content to the rule condition The condition identifier is used as a character to perform matching of the filter rule on the character, and the rule condition to which the content to be filtered is accurately matched is obtained by performing exact matching of the rule condition on the content to be filtered; the policy implementation unit 632 is configured to The matching result of the filtering rule performs a filtering policy corresponding to the matching result.
通过以条件标识代表规则条件, 并进一步将过滤规则以正则表达式的 形式进行编译, 能够实现一次过滤匹配获得匹配结果。  By using the conditional identifier to represent the rule condition and further compiling the filter rule in the form of a regular expression, a filter match can be achieved to obtain a match result.
优选是, 该规则条件编译单元 62 3还用于当识別出输入的规则条件无 法提取关键字时, 将该规则条件放入待提示分組, 并为所述待提示分组的 规则条件预编译精确匹配数据集合, 并向用户发出规则条件不良提示。  Preferably, the rule condition compiling unit 62 3 is further configured to: when it is recognized that the input rule condition cannot extract the keyword, put the rule condition into the to-be-presented group, and pre-compile the rule condition of the group to be prompted. Matches the data collection and issues a bad rule condition to the user.
则对应的, 该规则条件匹配单元还用于当待过滤内容未匹配到关键字 时, 利用所述待提示分组的规则条件对应的精确匹配数据集合, 对未匹配 到关键字的所述待过滤内容进行规则条件的精确匹配。  Correspondingly, the rule condition matching unit is further configured to: when the content to be filtered does not match the keyword, use the exact matching data set corresponding to the rule condition of the to-be-presented group to filter the unmatched keyword The content performs an exact match of the rule conditions.
上述技术方案能够保证对所有待过滤内容的精确匹配, 且能提示用户 优化规则条件满足预过滤的分组要求。  The above technical solution can ensure an exact match for all the content to be filtered, and can prompt the user to optimize the rule conditions to meet the pre-filtered grouping requirements.
实施例七  Example 7
图 7为本发明实施例七提供的内容过滤装置的结构示意图, 本实施例 以上述实施例为基础, 其中, 关键字提取单元 621优选包括: 字段划分子 单元 621 a和字段筛选子单元 621 b。 其中, 字段划分子单元 62 1 a用于对输 入的规则条件, 按照预设划分策略进行字段划分; 字段 选子单元 62 1 b , 用于基于预设筛选策略对划分后的字段进行筛选得到所述规则条件的关 键字。 所述字段筛选子单元具体用于: 从所述划分后的字段中, 将与黑名 单中字段一致的字段删除; 按照记录的字段误命中次数, 将误命中次数高 于命中门限值的字段删除; 针对每个规则条件, 在该规则条件的各关键字 中选择该关键字分组的规则条件数量最少的字段筛选作为该规则条件的 关键字。 但, 本领域人员可以理解, 上述各项也可以独立执行, 或以其他 顺序执行, 还可以增加其他筛选策略, 例如将与白名单中字段一致的字段 筛选为关键字等。 FIG. 7 is a schematic structural diagram of a content filtering apparatus according to Embodiment 7 of the present invention. The present embodiment is based on the foregoing embodiment, where the keyword extracting unit 621 preferably includes: a field dividing subunit 621a and a field filtering subunit 621b. . The field dividing sub-unit 62 1 a is configured to perform field division according to a preset dividing policy for the input rule condition; the field selecting sub-unit 62 1 b , A keyword used to filter the divided fields based on a preset screening policy to obtain the rule conditions. The field filtering sub-unit is specifically configured to: delete, from the divided field, a field that is consistent with a field in the blacklist; according to the number of hits of the recorded field, the number of hits is higher than the hit threshold Delete; for each rule condition, select the field with the least number of rule conditions for the keyword grouping among the keywords of the rule condition as the keyword of the rule condition. However, those skilled in the art can understand that the foregoing items can also be executed independently or in other orders. Other screening policies can be added, such as filtering fields that match the fields in the whitelist as keywords.
为保证筛选策略的准确性, 该内容过滤模块还可包括统计更新单元, 该统计更新单元具体包括: 误命中次数记子单元和黑名单更新子单元。 其 中, 误命中次数记子单元用于当匹配到关键字的待过滤内容利用所述精确 匹配数据集合未匹配到对应的规则条件时, 更新该关键字的误命中次数记 录; 黑名单更新子单元用于将误命中次数高于设定门限值的关键字加入黑 名单。  To ensure the accuracy of the screening policy, the content filtering module may further include a statistical update unit, and the statistical update unit specifically includes: a hit count counter unit and a black list update sub unit. The number of hits is used to update the number of hits of the keyword when the content to be filtered that matches the keyword is not matched to the corresponding rule condition; the blacklist update subunit Used to blacklist keywords with a number of false hits above the set threshold.
关键字的提取策略决定着关键字的提取质量, 直接关系到预过滤效 率, 本实施例的技术方案可根据实际的内容过滤情况动态更新关键字筛选 策略所使用的数据, 使得提取的关键字更能反映内容过滤的需求。  The keyword extraction policy determines the quality of the keyword extraction, which is directly related to the pre-filtering efficiency. The technical solution in this embodiment can dynamically update the data used by the keyword screening policy according to the actual content filtering situation, so that the extracted keywords are more Can reflect the needs of content filtering.
在上述技术方案的基础上, 可根据实际情况对不同分组釆用不同的匹 配算法, 即规则条件编译单元具体包括:  On the basis of the above technical solutions, different matching algorithms can be used for different groups according to actual conditions, that is, the rule condition compiling unit specifically includes:
第一编译子单元, 用于对于规则条件的数量小于预配置门限值的分 组, 为该组规则条件采用 NFA、 DFA或者压缩的 DFA正则表达式匹配算法 预编译精确匹配数据集合, 或采用单模字符串匹配算法预编译精确匹配数 据集合;  a first compiling subunit, configured to pre-compile an exact matching data set for the set of rule conditions using a NFA, DFA, or compressed DFA regular expression matching algorithm for a group whose rule condition is less than a pre-configured threshold value, or adopt a single The modulo string matching algorithm precompiles the exact matching data set;
第二编译子单元, 用于对于规则条件的数量等于或大于预配置门限值 的分组, 为该组规则条件釆用 DFA或者压缩的 DFA正则表达式匹配算法预 编译精确匹配数据集合;  a second compiling sub-unit, configured to pre-compile an exact matching data set for the set of rule conditions using a DFA or a compressed DFA regular expression matching algorithm for a group of rule conditions having a number equal to or greater than a pre-configured threshold;
第三编译子单元, 用于对于包括具有设定复杂定义参数的规则条件的 分组, 为该组规则条件采用 NFA或者压缩的 DFA正则表达式匹配算法预编 译精确匹配数据集合。  A third compiling sub-unit is configured to pre-compile the exact matching data set for the set of rule conditions using a NFA or compressed DFA regular expression matching algorithm for the grouping comprising rule conditions having a set complex definition parameter.
实施例八 图 8为本发明实施例八提供的内容过滤装置的结构示意图, 本实施例 以上述实施例为基础, 改进在于内容获取模块 61 0可具体包括协议识别单 元 611和协议解析单元 612。 其中, 协议识别单元 611用于对接收到的数 据包采用深度报文识别技术进行协议识别; 协议解析单元 612用于基于识 别到的协议, 对所述数据包进行字段解析, 以获取至少一个预设字段, 将 各预设字段分别作为待过滤内容, 以便分别执行后续的分组匹配、 精确匹 配和过滤匹配操作,其中,所述过滤规则由一条或多条规则条件组合而成, 且所述过滤规则由对应于一个或多个预设字段的一条或多条规则条件组 合而成。 Example eight FIG. 8 is a schematic structural diagram of a content filtering apparatus according to Embodiment 8 of the present invention. The present embodiment is based on the foregoing embodiment. The improvement is that the content obtaining module 610 may specifically include a protocol identifying unit 611 and a protocol parsing unit 612. The protocol identification unit 611 is configured to perform protocol identification on the received data packet by using a deep packet identification technology. The protocol parsing unit 612 is configured to perform field parsing on the data packet to obtain at least one pre- A field is set, and each preset field is separately used as a content to be filtered, so as to perform subsequent group matching, exact matching, and filtering matching operations respectively, wherein the filtering rule is composed of one or more rule conditions, and the filtering is performed. A rule is a combination of one or more rule conditions corresponding to one or more preset fields.
本发明实施例所提供的内容过滤装置可执行本发明任意实施例所提 供的内容过滤方法, 具备相应的功能模块结构。  The content filtering apparatus provided by the embodiment of the present invention may perform the content filtering method provided by any embodiment of the present invention, and has a corresponding functional module structure.
实施例九  Example nine
本发明实施例九将以优选实例的方式详细介绍内容过滤方法的细节。 本发明实施例所提供的内容过滤方法是基于文本应用层协议来执行的, 规 则条件可以是协议中任何字段, 例如: URL地址、 请求方法、 某个头域等。 本实施例以 URL地址字段为例进行说明, 但本领域技术人员可以理解, 其 他字段的预编译数据集合和匹配过滤方法可采用相同方案完成。  The ninth embodiment of the present invention will describe the details of the content filtering method in detail by way of a preferred example. The content filtering method provided by the embodiment of the present invention is performed based on a text application layer protocol, and the rule condition may be any field in the protocol, such as: a URL address, a request method, a certain header field, and the like. This embodiment uses the URL address field as an example for description. However, those skilled in the art can understand that the pre-compiled data set and the matching filtering method of other fields can be completed by the same scheme.
图 9为本发明实施例九所适用的网络架构示意图, 该网络中包括了局 域网( Loca l Area Network , 简称 LAN )网元、 广域网( Wi de Area Network , 简称 WAN ) 网元、 路由器 (Router ) 和交换机 ( Swi t ch ) 等。 用户终端通 过 LAN经交换机和路由器连接至 WAN。其中,一应用控制节点( Appl icat ion Cont ro l Po int ) 部署在 LAN和 WAN之间, 实现内容过滤, 应当理解的是, 这里的应用控制节点具有本发明实施例的内容过滤装置的功能, 在不同的 实现方式下, 这里的应用控制节点可以是企业路由器, 或网关 GPRS支持 节点 (Ga teway GPRS Suppor t Node , 简称 GGSN)网元设备、 Internet网 关设备和无线控制器设备等执行内容过滤的网元。  FIG. 9 is a schematic diagram of a network architecture applicable to a ninth embodiment of the present invention, where the network includes a local area network (LAN) network element, a wide area network (WAN) network element, and a router (Router). And switches (Swi t ch ) and so on. The user terminal is connected to the WAN via a LAN via a switch and a router. An application control node is deployed between the LAN and the WAN to implement content filtering. It should be understood that the application control node has the function of the content filtering device in the embodiment of the present invention. In different implementation manners, the application control node herein may be an enterprise router, or a gateway GPRS support node (Gatex GPRS Supper t Node, GGSN for short) network element device, an Internet gateway device, and a wireless controller device, etc. Network element.
内容过滤装置的结构可参加实施例七或八所示, 具体执行本发明实施 例提供的内容过滤方法, 该方法主要包括预编译流程和过滤流程。  The content filtering device is configured to participate in the embodiment 7 or 8 to specifically perform the content filtering method provided by the embodiment of the present invention. The method mainly includes a pre-compilation process and a filtering process.
图 10为本发明实施例九提供的内容过滤方法中提取关键字的过程示 意图, 基于各项筛选策略, 第 1步划分(Parse ) 字段, 第二步从划分的 字段中按照黑名单过滤关键字; 第 3步按照误命中次数筛选关键字, 第 4 步按照规则条件数量最少的 选策略选择关键字。 最终从规则条件中篩选 出 msdn作为关键字。 FIG. 10 is a schematic diagram of a process for extracting keywords in a content filtering method according to Embodiment 9 of the present invention. Based on each screening policy, the first step is to divide the (Parse) field, and the second step is to divide the The keyword is filtered by the blacklist in the field; the third step filters the keyword according to the number of missed hits, and the fourth step selects the keyword according to the selection strategy with the least number of rule conditions. Finally, msdn is selected as a keyword from the rule conditions.
图 11为本发明实施例九提供的内容过滤方法中执行过滤流程的示意 图, 图 11示出了规则条件预编译阶段和规则条件匹配过滤阶段。  FIG. 11 is a schematic diagram of a filtering process performed in a content filtering method according to Embodiment 9 of the present invention, and FIG. 11 illustrates a rule condition pre-compilation phase and a rule condition matching filtering phase.
在规则条件预编译阶段中, 输入的规则条件如下:  In the rule condition precompilation phase, the rule conditions entered are as follows:
1: www. huawei*. com  1: www. huawei*. com
2: www [0-3] . huawei. com  2: www [0-3] . huawei. com
3: *google. com/news  3: *google. com/news
4: www. s ina [0-9] . com  4: www. s ina [0-9] . com
5: www. yahoo*, com/news  5: www. yahoo*, com/news
6: *. microsof t. *  6: *. microsof t. *
7: www. msdn. microsof t*/news  7: www.msdn. microsof t*/news
8: www. [a-z] [a-z] [a- z] . com. cn  8: www. [a-z] [a-z] [a- z] . com. cn
按照前述的筛选策略, 为每个规则条件筛选出关键字, 如图 11所示, 以 AC状态机编译分组匹配数据集合。 按照关键字分组, 如图 11所示, 第 1和 2个规则条件分入一组, 其他各自按关键字分组, 第 6和 8个无关键 字的规则条件分入不良规则条件组。 各自釆用算法预编译各组的精确匹配 数据集合。  According to the foregoing screening strategy, keywords are filtered for each rule condition, as shown in Fig. 11, the group matching data set is compiled by the AC state machine. According to the keyword grouping, as shown in Fig. 11, the first and second rule conditions are grouped into one group, the others are grouped by keyword, and the 6th and 8th uncharacterized rule conditions are classified into the bad rule condition group. Each uses an algorithm to precompile the exact matching data sets for each group.
在规则条件匹配阶段中, 获取待过滤内容, 送入内容过滤模块, 其配 置的匹配数据集合是预先配置好的, 而且也经过编译处理保留在内存中。 如图 11所示,待过滤内容为网站地址 www. huawei. com/news, 则内容过滤 模块首先将待过滤内容利用分组匹配数据集合进行关键字匹配, 例如, 将 待过滤内容在 AC状态机中进行多模匹配, 利用分组匹配数据集合进行预 过滤, 得出匹配到的关键字为 huaw。  In the rule condition matching phase, the content to be filtered is obtained and sent to the content filtering module, and the configured matching data set is pre-configured, and is also retained in the memory by the compiling process. As shown in FIG. 11, the content to be filtered is the website address www.huawei.com/news, the content filtering module first uses the group matching data set to perform keyword matching, for example, the content to be filtered is in the AC state machine. Multi-mode matching is performed, and the packet matching data set is used for pre-filtering, and the matched keyword is huaw.
则进一步利用这个关键字所对应分组的精确匹配数据集合看是否能 匹配到规则条件, 得到的匹配结果是匹配成功。  Then, the exact matching data set of the packet corresponding to the keyword is further used to see if the rule condition can be matched, and the matching result is that the matching is successful.
而后, 可以再将匹配的规则条件的条件标识作为字符, 利用过滤匹配 数据集合进行匹配。 匹配结果包括匹配成功和失败, 此时根据整个设备配 置的默认放行策略来处理该数据包。例如可包括白名单(匹配成功放行)、 黑名单(匹配成功过滤) 两种, 决定是否发送到策略实施模块进行进一步 处理。 Then, the conditional identifier of the matched rule condition can be used as a character, and the matching data set is matched by filtering. The matching results include matching success and failure, and the packet is processed according to the default release policy of the entire device configuration. For example, it can include a white list (matching successful release), There are two types of blacklists (matching successful filtering), and whether to send to the policy implementation module for further processing.
采用本发明各实施例提供的内容过滤方案具有诸多优点, 能够兼顾内 存占用量和匹配性能的问题。 该技术方案能支持复杂的规则条件, 如正则 表达式, 能支持多维度的内容过滤匹配, 不仅仅是 URL地址, 还支持任意 可配置的头域字段内容过滤。 通过预过滤和动态收集误命中关键字的方法 提高了匹配性能。 可动态收集影响性能的关键字加入黑名单, 并周期性地 调整内容过滤规则库, 即周期性地重复提取关键字-分组-预编译的过程, 以达到自适应目标运行环境的最佳性能平衡。  The content filtering solution provided by the embodiments of the present invention has many advantages, and can balance the problems of memory usage and matching performance. The solution supports complex rule conditions, such as regular expressions, and supports multi-dimensional content filtering matching, not just URL addresses, but also any configurable header field content filtering. Matching performance is improved by pre-filtering and dynamically collecting missed keywords. Dynamically collect keywords that affect performance, add blacklists, and periodically adjust the content filtering rule base, that is, periodically repeat the keyword-packet-precompilation process to achieve the optimal performance balance of the adaptive target operating environment. .
本发明实施例还提供了一种计算机系统, 如图 1 3 所示, 该计算机系 统包括至少一个处理器 1 31和存储器 1 32 ; 该存储器 1 32用于存储指令; 该处理器 1 31, 与存储器 1 32耦合, 处理器 1 31被配置为执行存储在存储 器 1 32中的指令, 以执行本发明任意实施例所提供的内容过滤方法。  The embodiment of the present invention further provides a computer system, as shown in FIG. 13, the computer system includes at least one processor 1 31 and a memory 1 32; the memory 1 32 is used to store instructions; the processor 1 31, The memory 1 32 is coupled, and the processor 1 31 is configured to execute instructions stored in the memory 1 32 to perform the content filtering method provided by any of the embodiments of the present invention.
具体是, 该处理器 1 31可被配置为执行存储在存储器 1 32中的指令, 以执行 ^下流程:  Specifically, the processor 1 31 can be configured to execute instructions stored in the memory 1 32 to perform the following process:
从输入的一条或多条规则条件中分别提取关键字;  Extract keywords from one or more rule conditions entered;
根据提取的关键字对所述一条或多条规则条件划分成一个或多个分 组, 使得同一分组中的规则条件具有相同的关键字, 并为所述提取的关键 字预编译分组匹配数据集合;  Dividing the one or more rule conditions into one or more packets according to the extracted keywords, so that the rule conditions in the same group have the same keyword, and pre-compiling the group matching data set for the extracted keywords;
分别为所述提取的关键字中的各关键字对应分组的规则条件预编译 精确匹配数据集合;  Pre-compiling the exact matching data set for the rule condition of each of the extracted keywords corresponding to the grouping;
获取待过滤内容;  Get the content to be filtered;
利用所述分组匹配数据集合, 对所述待过滤内容进行关键字的匹配, 得到匹配到的关键字;  Using the packet matching data set, performing keyword matching on the to-be-filtered content to obtain a matched keyword;
利用匹配到的关键字对应分组的规则条件的精确匹配数据集合, 对所 述待过滤内容进行规则条件的精确匹配;  Using the exact matching data set of the rule condition of the matched keyword corresponding group, the rule condition is accurately matched to the content to be filtered;
根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策略。 在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 进一步执行如下流程:  Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:
为所述一条或多条规则条件分别分配唯一的条件标识, 为过滤规则预 编译过滤匹配数据集合, 其中, 所述过滤规则由所述一条或多条规则条件 组合而成, 且利用所述一条或多条规则条件的条件标识作为字符来表达所 述过滤规则; Assign a unique condition identifier to the one or more rule conditions, and pre-filter rules Generating a filter matching data set, wherein the filtering rule is formed by combining the one or more rule conditions, and using the condition identifier of the one or more rule conditions as a character to express the filtering rule;
则根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策 略包括:  Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching includes:
利用所述过滤匹配数据集合, 将待过滤内容精确匹配到的规则条件的 条件标识作为字符对所述字符进行过滤规则的匹配,所述待过滤内容精确 匹配到的规则条件由所迷对待过滤内容进行规则条件的精确匹配得到; 根据所述过滤规则的匹配结果执行与所述匹配结果对应的过滤策略。 在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 进一步执行如下流程:  Using the filter matching data set, the condition identifier of the rule condition to which the content to be filtered is exactly matched is used as a character to match the filtering rule of the character, and the rule condition to be matched by the content to be filtered is filtered by the content to be filtered. An exact matching of the rule conditions is performed; and a filtering policy corresponding to the matching result is performed according to the matching result of the filtering rule. In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:
当获取到新增的规则条件时, 从新增的规则条件中提取关键字; 根据从新增的规则条件中提取的关键字为新增的规则条件查找或创 建对应的分组, 并重新编译分组匹配数据集合;  When the new rule condition is obtained, the keyword is extracted from the newly added rule condition; the corresponding rule is searched or created according to the keyword extracted from the newly added rule condition, and the group is recompiled. Matching data sets;
根据所述新增的规则条件预编译对应分组的规则条件的精确匹配数 据集合;  Pre-compiling an exact matching data set of rule conditions of the corresponding group according to the newly added rule condition;
为所述新增的规则条件分配条件标识 , 并重新编译过滤匹配数据集 合。  Assign a condition ID to the new rule condition and recompile the filter match data set.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 进一步执行如下流程:  In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:
根据输入的规则条件删除指令, 确定待删除的规则条件或待删除规则 条件对应的条件标识, 从待删除规则条件中提取关键字;  Deleting an instruction according to the input rule condition, determining a rule condition to be deleted or a condition identifier corresponding to the condition to be deleted, and extracting a keyword from the rule to be deleted;
根据从待删除规则条件中提取的关键字更新分组匹配数据集合; 如果需删除所迷待删除规则条件,则对从待删除规则条件中提取的关 键字的对应分组的规则条件重新编译精确匹配数据集合, 以删除所述待删 除规则条件;  Updating the group matching data set according to the keyword extracted from the rule to be deleted; if the condition of the deleted rule is to be deleted, recompiling the exact matching data for the rule condition of the corresponding group of the keyword extracted from the rule to be deleted Collecting to delete the rule to be deleted;
如果需删除所述待删除规则条件对应的条件标识, 则重新编译所述过 滤匹配数据集合, 以删除所述待删除规则条件对应的条件标识。  If the condition identifier corresponding to the rule to be deleted is to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the rule to be deleted.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 则所述从输入的一条或多条规则条件中分别提 取关键字具体包括如下流程: In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute an instruction stored in the memory 1 32, and the one or more rule conditions are input from the input. The keywords include the following processes:
对输入的规则条件, 按照预设划分策略进行字段划分;  For the input rule conditions, the fields are divided according to the preset division strategy;
基于预设筛选策略对划分后的字段进行筛选得到所述规则条件的关 键字。  The divided fields are filtered based on a preset screening policy to obtain keywords of the rule conditions.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令,则基于预设筛选策略对划分后的字段进行筛选, 得到所述规则条件的关键字具体包括如下流程:  In the above content filtering method process, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then filter the divided fields based on a preset screening policy to obtain the key of the rule condition. The word specifically includes the following process:
从所述划分后的字段中, 将与黑名单中字段一致的字段删除; 按照记录的字段误命中次数, 将误命中次数高于命中门限值的字段删 除;  Deleting the field that matches the field in the blacklist from the divided field; deleting the field whose hit count is higher than the hit threshold according to the number of hits of the recorded field;
针对每个规则条件, 在该规则条件的各关键字中选择该关键字分组的 规则条件数量最少的字段筛选作为该规则条件的关键字。  For each rule condition, the field with the least number of rule conditions for selecting the keyword group among the keywords of the rule condition is selected as the keyword of the rule condition.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 则在利用匹配到的关键字对应分组的规则条件 的精确匹配数据集合, 对匹配到关键字的待过滤内容进行规则条件的精确 匹配之后, 进一步还执行如下流程:  In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then match the exact matching data set of the rule condition of the corresponding group using the matched keyword. After the exact matching of the rule conditions to the content to be filtered of the keyword, the following process is further performed:
当匹配到关键字的待过滤内容利用所述精确匹配数据集合未匹配到 对应的规则条件时, 更新该关键字的误命中次数记录;  When the content to be filtered that matches the keyword is not matched to the corresponding rule condition by using the exact matching data set, the number of missed hits of the keyword is updated;
将误命中率次数高于设定门限值的关键字加入黑名单。  Add keywords with a number of false hit ratios above the set threshold to the blacklist.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 则所述分别为所述提取的关键字中的各关键字 对应分组的规则条件预编译精确匹配数据集合具体包括如下流程:  In the above content filtering method flow, preferably, the processor 1 31 may be configured to execute an instruction stored in the memory 1 32, and then the rule conditions of each keyword corresponding to the extracted keywords are respectively Precompiling the exact match data set specifically includes the following process:
对于规则条件的数量小于预配置门限值的分组, 为该组规则条件釆用 非确定有限状态自动机、 确定有限状态自动机或者压缩的确定有限状态自 动机正则表达式匹配算法预编译精确匹配数据集合, 或釆用单模字符串匹 配算法预编译精确匹配数据集合;  For a group of rule conditions whose number is less than the pre-configured threshold, for the set of rule conditions, use a non-deterministic finite state automaton, determine a finite state automaton, or compress the determined finite state automaton regular expression matching algorithm to precompile an exact match. Data collection, or pre-compiling an exact matching data set using a single-mode string matching algorithm;
对于规则条件的数量等于或大于预配置门限值的分组, 为该组规则条 件采用确定有限状态自动机或者压缩的确定有限状态自动机正则表达式 匹配算法预编译精确匹配数据集合;  For a group of rule conditions having a number equal to or greater than a pre-configured threshold, a set of rule conditions is used to determine a finite state automaton or a compressed finite state automaton regular expression matching algorithm to precompile the exact match data set;
对于包括具有设定复杂定义参数的规则条件的分组, 为该组规则条件 采用非确定有限状态自动机或者压缩的确定有限状态自动机正则表达式 匹配算法预编译精确匹配数据集合。 For groupings that include rule conditions with set complex definition parameters, the set of rule conditions The finite state automaton regular expression matching algorithm is used to precompile the exact matching data set using a non-deterministic finite state automaton or compression.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 则所述获取待过滤内容具体包括如下流程: 对接收到的数据包采用深度报文识别技术进行协议识别;  In the above content filtering method process, preferably, the processor 1 31 is configured to execute the instructions stored in the memory 1 32, and the obtaining the content to be filtered specifically includes the following process: using the deep report on the received data packet Text recognition technology for protocol identification;
基于识别到的协议, 对所述数据包进行字段解析, 以获取至少一个预 设字段, 将各预设字段分别作为待过滤内容, 以便分别执行后续的分组匹 配、 精确匹配和过滤匹配操作, 其中, 所述过滤规则由一条或多条规则条 件组合而成, 且所述过滤规则由对应于一个或多个预设字段的一条或多条 规则条件组合而成。  Performing field parsing on the data packet to obtain at least one preset field, and using each preset field as the content to be filtered, respectively, to perform subsequent group matching, exact matching, and filtering matching operations, respectively. The filtering rule is a combination of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 则进一步还执行如下流程:  In the above content filtering method flow, preferably, the processor 1 31 is configurable to execute the instructions stored in the memory 1 32, and further performs the following process:
当识别出输入的规则条件无法提取关键字时, 将该规则条件放入待提 示分组, 并为所述待提示分组的规则条件预编译精确匹配数据集合, 并向 用户发出规则条件不良提示。  When it is recognized that the input rule condition cannot extract the keyword, the rule condition is put into the to-be-proposed group, and the exact matching data set is pre-compiled for the rule condition of the to-be-presented group, and the rule condition bad prompt is issued to the user.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 则在利用所迷分組匹配数据集合, 对所述待过 滤内容进行关键字的匹配之后, 还执行如下流程:  In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then use the group matching data set to perform keyword matching on the to-be-filtered content. After that, the following process is also performed:
当待过滤内容未匹配到关键字时,利用所迷待提示分组的规则条件对 应的精确匹配数据集合, 对未匹配到关键字的所迷待过滤内容进行规则条 件的精确匹配。  When the content to be filtered does not match the keyword, the exact matching data set corresponding to the rule condition of the prompting group is used, and the ruled condition of the unfiltered content that does not match the keyword is accurately matched.
在上述内容过滤方法流程中, 优选是, 处理器 1 31可被配置为执行存 储在存储器 1 32中的指令, 则从输入的一条或多条规则条件中分别提取关 键字具体包括如下流程:  In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then extracting the keywords from the input one or more rule conditions specifically includes the following processes:
按照设定周期, 从已输入的一奈或多条规则条件中提取关键字。  The keywords are extracted from one or more rule conditions that have been entered according to the set period.
本发明实施例又提供了一种计算机系统, 如图 14 所示, 该计算机系 统包括: 处理器 141、 存储器 142和匹配过滤器 143。 该存储器 142用于 存储指令; 匹配过滤器 143用于配置各数据集合,例如分组匹配数据集合、 精确匹配数据集合、 以及过滤匹配数据集合等; 处理器 141与存储器 142 和匹配过滤器 14 3耦合, 处理器 141被配置为执行存储在存储器 142中的 指令, 以执行本发明实施例所提供内容过滤方法中的预编译流程, 且所述 处理器 141还被配置为调用匹配过滤器 143, 以执行本发明实施例所提供 内容过滤方法中的内容过滤流程。 The embodiment of the present invention further provides a computer system. As shown in FIG. 14, the computer system includes: a processor 141, a memory 142, and a matching filter 143. The memory 142 is used to store instructions; the matching filter 143 is configured to configure each data set, such as a packet matching data set, an exact matching data set, and a filtered matching data set, etc.; the processor 141 is coupled to the memory 142 and the matching filter 14 3 The processor 141 is configured to execute the storage in the memory 142 An instruction to perform a pre-compilation process in the content filtering method provided by the embodiment of the present invention, and the processor 141 is further configured to invoke the matching filter 143 to perform content filtering in the content filtering method provided by the embodiment of the present invention. Process.
优选是, 匹配过滤器可以通过硬件, 或硬件与软件结合的方式实现, 例如可以为现场可编程门阵列 ( F ie ld - Programmable Ga te Ar ray, 简称 FPGA ) 。 具体是以 FPGA 芯片的内存或者外部内存存储各种数据集合, 例 如分组匹配数据集合、各分组的精确匹配数据集合、过滤匹配数据集合等, 然后也由 FPGA 芯片实现各匹配单元的匹配逻辑, 利用各种数据集合对应 用协议数据进行内容匹配, 输出关键字匹配的结果至精确匹配数据集合, 或输出精确匹配结果至相应的过滤策略等。 再或者, 也可以通过 FPGA 实 现内容过滤之前的协议识别、 字段解析的操作。  Preferably, the matching filter can be implemented by hardware, or a combination of hardware and software. For example, it can be a Field Programmable Gate Array (FPGA). Specifically, the memory of the FPGA chip or the external memory stores various data sets, such as a packet matching data set, an exact matching data set of each group, a filtered matching data set, and the like, and then the matching logic of each matching unit is also implemented by the FPGA chip. The various data sets perform content matching on the application protocol data, output the result of the keyword matching to the exact matching data set, or output an exact matching result to the corresponding filtering policy. Alternatively, the protocol identification and field parsing operations before content filtering can be implemented by the FPGA.
本发明上述实施例提供的计算机系统可以配置为各种应用内容过滤 技术的网元,例如企业路由器、网关 GPRS支持节点( Ga teway GPRS Suppor t Node , 简称 GGSN)网元设备、 Internet网关设备和无线控制器设备。  The computer system provided by the foregoing embodiment of the present invention can be configured as various network elements for applying content filtering technologies, such as an enterprise router, a gateway GPRS Supper t Node (GGSN) network element device, an Internet gateway device, and a wireless device. Controller device.
在处理器通过执行存储器的指令和调用匹配过滤器的过程中, 具体是 处理器可以被配置为执行存储器中的指令, 以实现如下操作:  In the process of the processor executing the instructions of the memory and calling the matching filter, in particular, the processor can be configured to execute the instructions in the memory to:
从输入的一条或多条规则条件中分别提取关键字;  Extract keywords from one or more rule conditions entered;
根据提取的关键字对所述一条或多条规则条件划分成一个或多个分 组, 使得同一分组中的规则条件具有相同的关键字, 并为所述提取的关键 字预编译分组匹配数据集合;  Dividing the one or more rule conditions into one or more packets according to the extracted keywords, so that the rule conditions in the same group have the same keyword, and pre-compiling the group matching data set for the extracted keywords;
分别为所述提取的关键字中的各关键字对应分组的规则条件预编译 精确匹配数据集合;  Pre-compiling the exact matching data set for the rule condition of each of the extracted keywords corresponding to the grouping;
且处理器还可以被配置为调用匹配过滤器, 以实现如下操作: 获取待过滤内容;  And the processor may be further configured to invoke the matching filter to: perform the following operations: acquiring the content to be filtered;
利用所述分组匹配数据集合, 对所述待过滤内容进行关键字的匹配, 得到匹配到的关键字;  Using the packet matching data set, performing keyword matching on the to-be-filtered content to obtain a matched keyword;
利用匹配到的关键字对应分组的规则条件的精确匹配数据集合, 对所 述待过滤内容进行规则条件的精确匹配;  Using the exact matching data set of the rule condition of the matched keyword corresponding group, the rule condition is accurately matched to the content to be filtered;
根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策略。 可选是, 处理器可进一步被配置为执行存储器中的指令, 以实现如下 操作: Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. Optionally, the processor is further configured to execute instructions in the memory to implement the following Operation:
为所述一条或多条规则条件分别分配唯一的条件标识, 为过滤规则预 编译过滤匹配数据集合, 其中, 所迷过滤规则由一条或多条规则条件组合 而成, 且利用所述一条或多条规则条件的条件标识作为字符来表达所述过 滤规则;  Assigning a unique condition identifier to the one or more rule conditions, and pre-compiling the filter matching data set for the filtering rule, where the filtering rule is formed by combining one or more rule conditions, and using the one or more The condition identifier of the rule condition expresses the filter rule as a character;
则处理器还可以被配置为调用匹配过滤器, 以实现如下操作: 根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策略 包括: 利用所述过滤匹配数据集合, 将待过滤内容精确匹配到的规则条件 的条件标识作为字符, 对所述字符进行过滤规则的匹配,所述待过滤内容 精确匹配到的规则条件由所述对待过滤内容进行规则条件的精确匹配得 到;  The processor may be further configured to invoke the matching filter to: perform a filtering policy corresponding to the matching result according to the matching result of the exact matching, including: using the filtering matching data set, the content to be filtered The condition identifier of the rule condition that is precisely matched is used as a character to perform matching of the filter rule on the character, and the rule condition to which the content to be filtered is accurately matched is obtained by performing exact matching of the rule condition on the content to be filtered;
根据所述过滤规则的匹配结果执行与所述匹配结果对应的过滤策略。 可选是, 处理器可进一步被配置为执行存储器中的指令, 还实现如下 操作:  Performing a filtering policy corresponding to the matching result according to the matching result of the filtering rule. Alternatively, the processor can be further configured to execute instructions in the memory and also perform the following operations:
当获取到新增的规则条件时, 从新增的规则条件中提取关键字; 根据从新增的规则条件中提取的关键字为新增的规则条件查找或创 建对应的分组, 并重新编译分组匹配数据集合;  When the new rule condition is obtained, the keyword is extracted from the newly added rule condition; the corresponding rule is searched or created according to the keyword extracted from the newly added rule condition, and the group is recompiled. Matching data sets;
根据所述新增的规则条件预编译对应分组的规则条件的精确匹配数 据集合;  Pre-compiling an exact matching data set of rule conditions of the corresponding group according to the newly added rule condition;
为所述新增的规则条件分配条件标识, 并重新编译过滤匹配数据集 合。  Assign a condition ID to the new rule condition and recompile the filter match data set.
可选是, 处理器可进一步被配置为执行存储器中的指令, 还实现如下 操作:  Alternatively, the processor can be further configured to execute instructions in the memory and also perform the following operations:
根据输入的规则条件删除指令, 确定待删除的规则条件或待删除规则 条件对应的条件标识, 从待删除规则条件中提取关键字;  Deleting an instruction according to the input rule condition, determining a rule condition to be deleted or a condition identifier corresponding to the condition to be deleted, and extracting a keyword from the rule to be deleted;
根据从待删除规则条件中提取的关键字更新分组匹配数据集合; 如果需删除所述待删除规则条件, 则对从待删除规则条件中提取的关 键字的对应分组的规则条件重新编译精确匹配数据集合, 以删除所述待删 除规则条件;  Updating the group matching data set according to the keyword extracted from the rule to be deleted; if the rule to be deleted is to be deleted, recompiling the exact matching data for the rule condition of the corresponding group of the keyword extracted from the rule to be deleted Collecting to delete the rule to be deleted;
如果需删除所述待删除规则条件对应的条件标识, 则重新编译所述过 滤匹配数据集合, 以删除所述待删除规则条件对应的条件标识。 If the condition identifier corresponding to the condition of the rule to be deleted is to be deleted, recompile the Filtering the data set to delete the condition identifier corresponding to the rule to be deleted.
可选是, 处理器可进一步被配置为执行存储器中的指令, 还实现如下 操作:  Alternatively, the processor can be further configured to execute instructions in the memory and also perform the following operations:
根据新增的过滤规则或过滤规则删除指令, 重新编译所述过滤匹配数 据集合, 以新增或删除过滤规则。  The filter matching data set is recompiled according to the newly added filtering rule or filtering rule deletion instruction to add or delete a filtering rule.
可选是, 处理器可被配置为执行存储器中的指令, 以实现如下操作, 从输入的一条或多条规则条件中分别提取关键字包括:  Optionally, the processor is configurable to execute instructions in the memory to implement the following operations, respectively: extracting keywords from the input one or more rule conditions includes:
对输入的规则条件, 按照预设划分策略进行字段划分;  For the input rule conditions, the fields are divided according to the preset division strategy;
基于预设筛选策略对划分后的字段进行筛选得到所述规则条件的关 键字。  The divided fields are filtered based on a preset screening policy to obtain keywords of the rule conditions.
基于预设 选策略对划分后的字段进行 选, 得到所述规则条件的关 键字包括:  The divided fields are selected based on a preset selection policy, and the keywords for obtaining the rule conditions include:
从所述划分后的字段中, 将与黑名单中字段一致的字段删除; 按照记录的字段误命中次数, 将误命中次数高于命中门限值的字段删 除;  Deleting the field that matches the field in the blacklist from the divided field; deleting the field whose hit count is higher than the hit threshold according to the number of hits of the recorded field;
针对每个规则条件, 在该规则条件的各关键字中选择该关键字分组的 规则条件数量最少的字段筛选作为该规则条件的关键字。  For each rule condition, the field with the least number of rule conditions for selecting the keyword group among the keywords of the rule condition is selected as the keyword of the rule condition.
可选是, 处理器被配置为执行存储器中的指令, 以实现如下操作: 在 利用匹配到的关键字对应分組的规则条件的精确匹配数据集合, 对所述待 过滤内容进行规则条件的精确匹配之后, 还包括:  Optionally, the processor is configured to execute the instructions in the memory to: perform an exact matching of the rule conditions on the to-be-filtered content by using an exact matching data set of the rule condition of the matched keyword corresponding to the matching keyword After that, it also includes:
当匹配到关键字的待过滤内容利用所述精确匹配数据集合未匹配到 对应的规则条件时, 更新该关键字的误命中次数记录;  When the content to be filtered that matches the keyword is not matched to the corresponding rule condition by using the exact matching data set, the number of missed hits of the keyword is updated;
将误命中率次数高于设定门限值的关键字加入黑名单。  Add keywords with a number of false hit ratios above the set threshold to the blacklist.
可选是, 处理器配置为执行存储器中的指令, 以实现如下操作: 分别为所述提取的关键字中的各关键字对应分组的规则条件预编译 精确匹配数据集合包括:  Optionally, the processor is configured to execute the instructions in the memory to: perform pre-compilation of the rule conditions for each keyword corresponding to each of the extracted keywords, respectively, and the exact matching data set includes:
对于规则条件的数量小于预配置门限值的分组, 为该组规则条件釆用 NFA、 DFA或者压缩的 DFA正则表达式匹配算法预编译精确匹配数据集合, 或采用单模字符串匹配算法预编译精确匹配数据集合;  For packets whose number of rule conditions is less than the pre-configured threshold, pre-compile the exact match data set for the set of rule conditions using NFA, DFA, or compressed DFA regular expression matching algorithm, or pre-compile with single-mode string matching algorithm Accurately match data sets;
对于规则条件的数量等于或大于预配置门限值的分组, 为该组规则条 件采用 DFA或者压缩的 DFA正则表达式匹配算法预编译精确匹配数据集 合; For a group whose rule condition is equal to or greater than the pre-configured threshold, the rule group for the group Precompiling an exact matching data set using DFA or a compressed DFA regular expression matching algorithm;
对于包括具有设定复杂定义参数的规则条件的分组, 为该组规则条件 釆用 NFA或者压缩的 DFA正则表达式匹配算法预编译精确匹配数据集合。  For groupings that include rule conditions with set complex definition parameters, the exact matching data set is precompiled for the set of rule conditions using the NFA or compressed DFA regular expression matching algorithm.
可选是, 处理器可进一步被配置为执行存储器中的指令或调用匹配过 滤器, 以实现如下操作:  Optionally, the processor can be further configured to execute an instruction in the memory or to call a matching filter to:
所述获取待过滤内容包括:  The obtaining the content to be filtered includes:
对接收到的数据包采用深度报文识別技术进行协议识别;  Performing protocol identification on the received data packet using deep packet identification technology;
基于识别到的协议, 对所述数据包进行字段解析, 以获取至少一个预 设字段, 将各预设字段分别作为待过滤内容, 以便分别执行后续的分组匹 配、 精确匹配和过滤匹配操作, 其中, 所述过滤规则由一条或多条规则条 件组合而成, 且所述过滤规则由对应于一个或多个预设字段的一条或多条 规则条件组合而成。  Performing field parsing on the data packet to obtain at least one preset field, and using each preset field as the content to be filtered, respectively, to perform subsequent group matching, exact matching, and filtering matching operations, respectively. The filtering rule is a combination of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.
可选是, 处理器可进一步被配置为执行存储器中的指令, 以实现如下 操作:  Optionally, the processor is further configurable to execute instructions in the memory to:
当识别出输入的规则条件无法提取关键字时, 将该规则条件放入待提 示分组, 并为所述待提示分组的规则条件预编译精确匹配数据集合, 并向 用户发出规则条件不良提示。  When it is recognized that the input rule condition cannot extract the keyword, the rule condition is put into the to-be-proposed group, and the exact matching data set is pre-compiled for the rule condition of the to-be-presented group, and the rule condition bad prompt is issued to the user.
可选是, 处理器还可以被配置为调用匹配过滤器, 以实现如下操作: 在利用所述分组匹配数据集合, 对所述待过滤内容进行关键字的匹配之 后, 还包括: 当待过滤内容未匹配到关键字时, 利用所述待提示分组的规 则条件对应的精确匹配数据集合, 对未匹配到关键字的所述待过滤内容进 行规则条件的精确匹配。  Optionally, the processor is further configured to: call the matching filter to: perform the following operations: after the matching of the to-be-filtered content by using the packet matching data set, the method further includes: when the content to be filtered When the keyword is not matched, the exact matching data set corresponding to the rule condition of the to-be-prompted packet is used to perform exact matching of the rule condition on the to-be-filtered content that does not match the keyword.
可选是, 处理器被配置为执行存储器中的指令, 以实现如下操作: 从 输入的一条或多条规则条件中分别提取关键字包括: 按照设定周期, 从已 输入的一条或多条规则条件中提取关键字。  Optionally, the processor is configured to execute instructions in the memory to: extract the keywords from the input one or more rule conditions, respectively: according to the set period, from one or more rules that have been entered Extract keywords in the condition.
本领域普通技术人员可以理解: 实现上述各方法实施例的全部或部分 步骤可以通过程序指令相关的硬件来完成。 前述的程序可以存储于一计算 机可读取存储介质中。 该程序在执行时, 执行包括上述各方法实施例的步 骤; 而前述的存储介质包括: R0M、 RAM , 磁碟或者光盘等各种可以存储程 序代码的介质。 It will be understood by those skilled in the art that all or part of the steps of implementing the above method embodiments may be performed by hardware related to the program instructions. The aforementioned program can be stored in a computer readable storage medium. When the program is executed, the steps including the foregoing method embodiments are performed; and the foregoing storage medium includes: R0M, RAM, disk or optical disk, etc. The media of the sequence code.
最后应说明的是: 以上各实施例仅用以说明本发明的技术方案, 而非 对其限制; 尽管参照前述各实施例对本发明进行了详细的说明, 本领域的 普通技术人员应当理解: 其依然可以对前述各实施例所记载的技术方案进 行修改, 或者对其中部分或者全部技术特征进行等同替换; 而这些修改或 者替换, 并不使相应技术方案的本质脱离本发明各实施例技术方案的范 围。  It should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

权 利 要 求 书 claims
1、 一种内容过滤方法, 其特征在于, 包括: 1. A content filtering method, characterized by including:
从输入的一条或多条规则条件中分别提取关键字; Extract keywords from one or more entered rule conditions;
根据提取的关键字对所述一条或多条规则条件划分成一个或多个分 组, 使得同一分组中的规则条件具有相同的关键字, 并为所述提取的关键 字预编译分组匹配数据集合; Divide the one or more rule conditions into one or more groups according to the extracted keywords, so that the rule conditions in the same group have the same keywords, and precompile the group matching data set for the extracted keywords;
分别为所述提取的关键字中的各关键字对应分组的规则条件预编译 精确匹配数据集合; Precompile an exact matching data set for the rule conditions corresponding to the grouping of each keyword in the extracted keywords;
获取待过滤内容; Get content to be filtered;
利用所述分组匹配数据集合, 对所述待过滤内容进行关键字的匹配, 得到匹配到的关键字; Using the group matching data set, perform keyword matching on the content to be filtered to obtain the matched keywords;
利用匹配到的关键字对应分组的规则条件的精确匹配数据集合, 对所 述待过滤内容进行规则条件的精确匹配; Utilize the exact matching data set of the rule conditions of the group corresponding to the matched keywords to perform an exact match of the rule conditions on the content to be filtered;
根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策略。 Execute a filtering policy corresponding to the matching result according to the exact matching result.
2、 根据权利要求 1所述的内容过滤方法, 其特征在于, 还包括: 为所述一条或多条规则条件分別分配唯一的条件标识, 为过滤规则预 编译过滤匹配数据集合, 其中, 所述过滤规则由所述一条或多条规则条件 组合而成, 且利用所述一条或多条规则条件的条件标识作为字符来表达所 述过滤规则; 2. The content filtering method according to claim 1, further comprising: assigning a unique condition identifier to each of the one or more rule conditions, and precompiling a filter matching data set for the filtering rule, wherein: The filtering rule is composed of the one or more rule conditions, and the condition identifier of the one or more rule conditions is used as a character to express the filtering rule;
则根据所述精确匹配的匹配结果执行与所述匹配结果对应的过滤策 略包括: Then executing the filtering strategy corresponding to the matching result according to the matching result of the exact match includes:
利用所述过滤匹配数据集合, 将待过滤内容精确匹配到的规则条件的 条件标识作为字符对所迷字符进行过滤规则的匹配,所述待过滤内容精确 匹配到的规则条件由所述对待过滤内容进行规则条件的精确匹配得到; 根据所述过滤规则的匹配结果执行与所述匹配结果对应的过滤策略。 Using the filter matching data set, the condition identifier of the rule condition accurately matched by the content to be filtered is used as a character to match the filtering rule with the character, and the rule condition accurately matched by the content to be filtered is determined by the content to be filtered Execute the filtering policy corresponding to the matching result according to the matching result of the filtering rule.
3、 根据权利要求 2所述的内容过滤方法, 其特征在于, 还包括: 当获取到新增的规则条件时, 从新增的规则条件中提取关键字; 根据从新增的规则条件中提取的关键字为新增的规则条件查找或创 建对应的分组, 并重新编译分组匹配数据集合; 3. The content filtering method according to claim 2, further comprising: when obtaining a new rule condition, extracting keywords from the new rule condition; extracting keywords from the new rule condition according to Use the keywords to find or create corresponding groups for the new rule conditions, and recompile the group matching data set;
根据所述新增的规则条件预编译对应分组的规则条件的精确匹配数 据集合; Precompile the exact matching number of the rule conditions of the corresponding group based on the newly added rule conditions data collection;
为所述新增的规则条件分配条件标识, 并重新编译过滤匹配数据集 合。 Assign a condition identifier to the newly added rule condition, and recompile the filtered matching data set.
4、 根据权利要求 2所述的内容过滤方法, 其特征在于, 还包括: 根据输入的规则条件删除指令, 确定待删除的规则条件或待删除规则 条件对应的条件标识, 从待删除规则条件中提取关键字; 4. The content filtering method according to claim 2, further comprising: deleting instructions according to the input rule conditions, determining the rule conditions to be deleted or the condition identifier corresponding to the rule conditions to be deleted, and selecting the rule conditions to be deleted from the rule conditions to be deleted. Extract keywords;
根据从待删除规则条件中提取的关键字更新分组匹配数据集合; 如果需删除所迷待删除规则条件,则对从待删除规则条件中提取的关 键字的对应分组的规则条件重新编译精确匹配数据集合, 以删除所述待删 除规则条件; Update the group matching data set according to the keywords extracted from the rule conditions to be deleted; if the rule conditions to be deleted need to be deleted, recompile the exact matching data for the rule conditions corresponding to the group of keywords extracted from the rule conditions to be deleted Collection to delete the rule conditions to be deleted;
如果需删除所述待删除规则条件对应的条件标识, 则重新编译所述过 滤匹配数据集合, 以删除所述待删除规则条件对应的条件标识。 If the condition identifier corresponding to the rule condition to be deleted needs to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the rule condition to be deleted.
5、 根据权利要求 1-4任一所述的内容过滤方法, 其特征在于, 所述 从输入的一条或多条规则条件中分别提取关键字包括: 5. The content filtering method according to any one of claims 1 to 4, characterized in that: extracting keywords from one or more input rule conditions includes:
对输入的规则条件, 按照预设划分策略进行字段划分; For the input rule conditions, the fields are divided according to the preset division strategy;
基于预设筛选策略对划分后的字段进行筛选得到所述规则条件的关 键字。 The divided fields are filtered based on the preset filtering strategy to obtain the keywords of the rule conditions.
6、 根据权利要求 5所述的内容过滤方法, 其特征在于, 所述基于预 设筛选策略对划分后的字段进行筛选, 得到所述规则条件的关键字包括: 从所述划分后的字段中, 将与黑名单中字段一致的字段删除; 按照记录的字段误命中次数, 将误命中次数高于命中门限值的字段删 除; 6. The content filtering method according to claim 5, characterized in that: filtering the divided fields based on a preset filtering strategy, and obtaining the keywords of the rule conditions include: from the divided fields , delete the fields that are consistent with the fields in the blacklist; according to the recorded number of field false hits, delete the fields whose number of false hits is higher than the hit threshold;
针对每个规则条件, 在该规则条件的各关键字中选择该关键字分组的 规则条件数量最少的字段筛选作为该规则条件的关键字。 For each rule condition, select the field with the smallest number of rule conditions for the keyword group among the keywords of the rule condition to filter as the keyword for the rule condition.
7、 根据权利要求 6所述的内容过滤方法, 其特征在于, 在利用匹配 到的关键字对应分组的规则条件的精确匹配数据集合, 对匹配到关键字的 待过滤内容进行规则条件的精确匹配之后, 还包括: 7. The content filtering method according to claim 6, wherein the exact matching data set of the rule conditions of the group corresponding to the matched keywords is used to accurately match the rule conditions for the content to be filtered that matches the keywords. After that, it also includes:
当匹配到关键字的待过滤内容利用所述精确匹配数据集合未匹配到 对应的规则条件时, 更新该关键字的误命中次数记录; When the content to be filtered that matches a keyword does not match the corresponding rule condition using the exact matching data set, update the record of the number of false hits for the keyword;
将误命中率次数高于设定门限值的关键字加入黑名单。 Add keywords with a false hit rate higher than the set threshold to the blacklist.
8、 根据权利要求 1-4任一所述的内容过滤方法, 其特征在于, 所述 分别为所述提取的关键字中的各关键字对应分组的规则条件预编译精确 匹配数据集合包括: 8. The content filtering method according to any one of claims 1 to 4, characterized in that the pre-compiled accurate matching data set of rule conditions corresponding to groups of each keyword in the extracted keywords includes:
对于规则条件的数量小于预配置门限值的分组, 为该组规则条件釆用 非确定有限状态自动机、 确定有限状态自动机或者压缩的确定有限状态自 动机正则表达式匹配算法预编译精确匹配数据集合, 或釆用单模字符串匹 配算法预编译精确匹配数据集合; For a group whose number of rule conditions is less than the preconfigured threshold, a nondeterministic finite state automaton, a deterministic finite state automaton, or a compressed deterministic finite state automaton regular expression matching algorithm is used to precompile exact matching for the group of rule conditions. A data set, or a pre-compiled exact matching data set using a single-mode string matching algorithm;
对于规则条件的数量等于或大于预配置门限值的分组, 为该组规则条 件采用确定有限状态自动机或者压缩的确定有限状态自动机正则表达式 匹配算法预编译精确匹配数据集合; For a group whose number of rule conditions is equal to or greater than the preconfigured threshold value, the deterministic finite state automaton or compressed deterministic finite state automaton regular expression matching algorithm is used to precompile the exact matching data set for this group of rule conditions;
对于包括具有设定复杂定义参数的规则条件的分组, 为该组规则条件 采用非确定有限状态自动机或者压缩的确定有限状态自动机正则表达式 匹配算法预编译精确匹配数据集合。 For groups that include rule conditions with complex definition parameters, a nondeterministic finite state automaton or a compressed deterministic finite state automaton regular expression matching algorithm is used to precompile an exact matching data set for the group of rule conditions.
9、 根据权利要求 2-4任一所述的内容过滤方法, 其特征在于, 所述 获取待过滤内容包括: 9. The content filtering method according to any one of claims 2 to 4, characterized in that said obtaining the content to be filtered includes:
对接收到的数据包采用深度 文识别技术进行协议识别; Use deep text recognition technology to perform protocol identification on received data packets;
基于识别到的协议, 对所述数据包进行字段解析, 以获取至少一个预 设字段, 将各预设字段分別作为待过滤内容, 以便分別执行后续的分组匹 配、 精确匹配和过滤匹配操作, 其中, 所述过滤规则由一条或多条规则条 件组合而成, 且所述过滤规则由对应于一个或多个预设字段的一条或多条 规则条件组合而成。 Based on the identified protocol, field parsing is performed on the data packet to obtain at least one preset field, and each preset field is used as content to be filtered in order to perform subsequent group matching, exact matching and filter matching operations respectively, where , the filtering rule is composed of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.
10、 根据权利要求 1-4任一所述的内容过滤方法, 其特征在于, 还包 括: 10. The content filtering method according to any one of claims 1 to 4, further comprising:
当识别出输入的规则条件无法提取关键字时, 将该规则奈件放入待提 示分组, 并为所述待提示分组的规则条件预编译精确匹配数据集合, 并向 用户发出规则条件不良提示。 When it is recognized that the input rule condition cannot extract keywords, the rule software is put into the group to be prompted, an exact matching data set is pre-compiled for the rule conditions of the group to be prompted, and a bad rule condition prompt is issued to the user.
11、 根据权利要求 10所述的内容过滤方法, 其特征在于, 在利用所 述分组匹配数据集合,对所述待过滤内容进行关键字的匹配之后,还包括: 当待过滤内容未匹配到关键字时,利用所述待提示分组的规则奈件对 应的精确匹配数据集合, 对未匹配到关键字的所述待过滤内容进行规则条 件的精确匹配。 11. The content filtering method according to claim 10, characterized in that, after using the group matching data set to perform keyword matching on the content to be filtered, it further includes: when the content to be filtered does not match the key When the keywords are found, use the exact matching data set corresponding to the rule elements of the group to be prompted to rule the content to be filtered that does not match the keywords. Exact matching of parts.
12、 根据权利要求 1-4任一所述的内容过滤方法, 其特征在于, 从输 入的一条或多条规则条件中分别提取关键字包括: 12. The content filtering method according to any one of claims 1 to 4, characterized in that extracting keywords from one or more input rule conditions includes:
按照设定周期, 从已输入的一奈或多条规则条件中提取关键字。 According to the set period, keywords are extracted from one or more entered rule conditions.
1 3、 一种内容过滤装置, 其特征在于, 包括内容获取模块、 内容过滤 模块和策略实施模块, 其中, 1 3. A content filtering device, characterized by including a content acquisition module, a content filtering module and a policy implementation module, wherein,
所述内容获取模块, 用于获取待过滤内容; The content acquisition module is used to acquire content to be filtered;
所述内容过滤模块包括: The content filtering module includes:
关键字提取单元, 用于从输入的一条或多条规则条件中分别提取 关键字; The keyword extraction unit is used to extract keywords from one or more input rule conditions;
分组编译单元, 用于根据提取的关键字对所述一条或多条规则条 件划分成一个或多个分组, 使得同一分组中的规则条件具有相同的关 键字, 并为所述提取的关键字预编译分组匹配数据集合; A group compilation unit, used to divide the one or more rule conditions into one or more groups according to the extracted keywords, so that the rule conditions in the same group have the same keywords, and prepare the extracted keywords Compile group matching data collection;
规则条件编译单元, 用于分别为所述提取的关键字中的各关键字 对应分组的规则条件预编译精确匹配数据集合; A rule condition compilation unit, configured to pre-compile an exact matching data set for rule conditions corresponding to groups of each keyword in the extracted keywords;
分组匹配单元, 用于利用所述分组匹配数据集合, 对所述待过滤 内容进行关键字的匹配,得到匹配到的关键字; A group matching unit, configured to use the group matching data set to perform keyword matching on the content to be filtered to obtain the matched keywords;
规则条件匹配单元, 用于利用匹配到的关键字对应分组的规则条 件的精确匹配数据集合, 对所述待过滤内容进行规则条件的精确匹 配; The rule condition matching unit is used to use the exact matching data set of the rule conditions of the corresponding group of the matched keywords to accurately match the rule conditions for the content to be filtered;
所述策略实施模块, 用于根据所述精确匹配的匹配结果执行与所述匹 配结果对应的过滤策略。 The policy implementation module is configured to execute a filtering policy corresponding to the matching result according to the exact matching result.
14、 根据权利要求 1 3所述的内容过滤装置, 其特征在于: 14. The content filtering device according to claim 13, characterized in that:
所述内容过滤模块还包括: 过滤规则编译单元, 用于为所述一条或多 条规则条件分别分配唯一的奈件标识, 为过滤规则预编译过滤匹配数据集 合, 其中, 所述过滤规则由一条或多条规则条件组合而成, 且利用所述一 条或多条规则条件的条件标识作为字符来表达所述过滤规则; The content filtering module also includes: a filtering rule compilation unit, used to allocate unique event identifiers to the one or more rule conditions, and pre-compile the filtering matching data set for the filtering rule, wherein the filtering rule consists of a or a combination of multiple rule conditions, and the condition identifier of the one or more rule conditions is used as a character to express the filtering rule;
所述策略实施模块包括: The policy implementation module includes:
过滤规则匹配单元, 用于利用所迷过滤匹配数据集合, 将待过滤内容 精确匹配到的规则条件的条件标识作为字符, 对所述字符进行过滤规则的 匹配,所述待过滤内容精确匹配到的规则条件由所述对待过滤内容进行规 则条件的精确匹配得到; The filtering rule matching unit is used to use the filtered matching data set to use the condition identifier of the rule condition that the content to be filtered accurately matches as a character, and perform filtering rules on the character. Matching, the rule condition to which the content to be filtered is accurately matched is obtained by the exact matching of the rule condition to the content to be filtered;
策略实施单元, 用于根据所述过滤规则的匹配结果执行与所迷匹配结 果对应的过滤策略。 A policy implementation unit, configured to execute a filtering policy corresponding to the matching result according to the matching result of the filtering rule.
15、 根据权利要求 1 3或 14所述的内容过滤装置, 其特征在于, 所述 规则条件编译单元还用于当识别出输入的规则条件无法提取关键字时, 将 该规则条件放入待提示分组, 并为所述待提示分组的规则条件预编译精确 匹配数据集合, 并向用户发出规则奈件不良提示。 15. The content filtering device according to claim 13 or 14, characterized in that the rule condition compilation unit is also used to put the rule condition into a prompt when it is recognized that the input rule condition cannot extract keywords. Group the data into groups, precompile an exact matching data set for the rule conditions of the group to be prompted, and issue a bad rule conditioner prompt to the user.
16、 根据权利要求 15所述的内容过滤装置, 其特征在于, 所述规则 条件匹配单元还用于当待过滤内容未匹配到关键字时, 利用所述待提示分 组的规则条件对应的精确匹配数据集合, 对未匹配到关键字的所述待过滤 内容进行规则条件的精确匹配。 16. The content filtering device according to claim 15, wherein the rule condition matching unit is also configured to use an exact match corresponding to the rule condition of the group to be prompted when the content to be filtered does not match a keyword. The data collection is used to accurately match the rule conditions for the content to be filtered that does not match the keywords.
17、 根据权利要求 1 3或 14所述的内容过滤装置, 其特征在于, 所述 关键字提取单元包括: 17. The content filtering device according to claim 13 or 14, characterized in that the keyword extraction unit includes:
字段划分子单元, 用于对输入的规则条件, 按照预设划分策略进行字 段划分; The field division subunit is used to divide the input rule conditions into fields according to the preset division strategy;
字段筛选子单元, 用于基于预设筛选策略对划分后的字段进行筛选, 得到所述规则条件的关键字。 The field filtering subunit is used to filter the divided fields based on the preset filtering strategy to obtain the keywords of the rule conditions.
18、 根据权利要求 17所述的内容过滤装置, 其特征在于, 其中, 所 述字段筛选子单元具体用于: 18. The content filtering device according to claim 17, wherein the field filtering subunit is specifically used for:
从所述划分后的字段中, 将与黑名单中字段一致的字段删除; 按照记录的字段误命中次数, 将误命中率高于命中门限值的字段删 除; From the divided fields, delete fields that are consistent with the fields in the blacklist; delete fields with a false hit rate higher than the hit threshold according to the recorded number of field false hits;
针对每个规则条件, 在该规则条件的各关键字中选择该关键字分组的 规则条件数量最少的字段筛选作为该规则条件的关键字。 For each rule condition, select the field with the smallest number of rule conditions for the keyword group among the keywords of the rule condition to filter as the keyword for the rule condition.
19、 根据权利要求 18所述的内容过滤装置, 其特征在于, 所述内容 过滤模块还包括统计更新单元, 所述统计更新单元包括: 19. The content filtering device according to claim 18, wherein the content filtering module further includes a statistics update unit, and the statistics update unit includes:
误命中次数记子单元, 用于当匹配到关键字的待过滤内容利用所述精 确匹配数据集合未匹配到对应的规则条件时, 更新该关键字的误命中次数 记录; 黑名单更新子单元, 用于将误命中次数高于设定门限值的关键字加入 黑名单。 The false hit count subunit is used to update the record of the false hit count of the keyword when the content to be filtered that matches the keyword does not match the corresponding rule condition using the exact matching data set; The blacklist update subunit is used to add keywords whose number of false hits is higher than the set threshold to the blacklist.
20、 根据权利要求 1 3或 14所述的内容过滤装置, 其特征在于, 所述 规则条件编译单元包括: 20. The content filtering device according to claim 13 or 14, characterized in that the rule condition compilation unit includes:
第一编译子单元, 用于对于规则条件的数量小于预配置门限值的分 组, 为该组规则条件采用非确定有限状态自动机确定有限状态自动机或者 压缩的确定有限状态自动机正则表达式匹配算法预编译精确匹配数据集 合, 或采用单模字符串匹配算法预编译精确匹配数据集合; The first compilation subunit is used for groups whose number of rule conditions is less than the preconfigured threshold value. For this group of rule conditions, a non-deterministic finite state automaton is used to determine a finite state automaton or a compressed deterministic finite state automaton regular expression. The matching algorithm pre-compiles the exact matching data set, or uses the single-mode string matching algorithm to pre-compile the exact matching data set;
第二编译子单元, 用于对于规则条件的数量等于或大于预配置门限值 的分组, 为该组规则条件釆用确定有限状态自动机或者压缩的确定有限状 态自动机正则表达式匹配算法预编译精确匹配数据集合; The second compilation subunit is used to pre-program a group whose number of rule conditions is equal to or greater than the preconfigured threshold value by using a deterministic finite state automaton or a compressed deterministic finite state automaton regular expression matching algorithm for this group of rule conditions. Compile an exact match data set;
第三编译子单元, 用于对于包括具有设定复杂定义参数的规则条件的 分组, 为该组规则条件采用非确定有限状态自动机或者压缩的确定有限状 态自动机正则表达式匹配算法预编译精确匹配数据集合。 The third compilation subunit is used for grouping rules that include rule conditions with complex definition parameters. For this group of rule conditions, a non-deterministic finite state automaton or a compressed deterministic finite state automaton regular expression matching algorithm is used to precompile accurately. Match data collection.
21、 根据权利要求 1 3或 14所述的内容过滤装置, 其特征在于, 所述 内容获取模块包括: 21. The content filtering device according to claim 13 or 14, characterized in that the content acquisition module includes:
协议识别单元, 用于对接收到的数据包釆用深度报文识别技术进行协 议识別; The protocol identification unit is used to perform protocol identification on the received data packets using deep packet identification technology;
协议解析单元,用于基于识別到的协议,对所述数据包进行字段解析, 以获取至少一个预设字段, 将各预设字段分别作为待过滤内容, 以便分别 执行后续的分组匹配、 精确匹配和过滤匹配操作, 其中, 所述过滤规则由 一条或多条规则条件组合而成, 且所述过滤规则由对应于一个或多个预设 字段的一条或多条规则条件组合而成。 The protocol parsing unit is used to perform field parsing on the data packet based on the identified protocol to obtain at least one preset field, and use each preset field as content to be filtered, so as to perform subsequent group matching and accuracy respectively. Matching and filtering matching operations, wherein the filtering rule is composed of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.
PCT/CN2013/073462 2012-06-30 2013-03-29 Content filtration method and device WO2014000485A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210223008.5 2012-06-30
CN201210223008.5A CN102857493B (en) 2012-06-30 2012-06-30 Content filtering method and device

Publications (1)

Publication Number Publication Date
WO2014000485A1 true WO2014000485A1 (en) 2014-01-03

Family

ID=47403688

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/073462 WO2014000485A1 (en) 2012-06-30 2013-03-29 Content filtration method and device

Country Status (2)

Country Link
CN (1) CN102857493B (en)
WO (1) WO2014000485A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899264A (en) * 2015-05-21 2015-09-09 东软集团股份有限公司 Multi-mode regular expression matching method and apparatus
CN107784478A (en) * 2016-08-31 2018-03-09 北京国双科技有限公司 The treating method and apparatus of administrative organization's information
CN115047835A (en) * 2022-06-27 2022-09-13 中国核动力研究设计院 Method, device, equipment and medium for acquiring periodic test data based on DCS (distributed control System)

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102857493B (en) * 2012-06-30 2015-07-08 华为技术有限公司 Content filtering method and device
CN103188267B (en) * 2013-03-27 2015-12-09 中国科学院声学研究所 A kind of protocol analysis method based on DFA
WO2015165245A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage data processing method and device
CN105095236A (en) * 2014-04-30 2015-11-25 优视科技有限公司 Advertisement filtering method and device
CN104462583A (en) * 2014-12-30 2015-03-25 北京奇虎科技有限公司 Browser device for advertisement blocking processing and mobile terminal
CN104778197B (en) * 2014-12-30 2019-02-01 北京锐安科技有限公司 A kind of data search method and device
CN105335486A (en) * 2015-10-15 2016-02-17 桂林电子科技大学 Data filter method and device
CN106713254B (en) * 2015-11-18 2019-08-06 中国科学院声学研究所 It is a kind of match canonic(al) ensemble generation and deep packet inspection method
CN105938475A (en) * 2015-12-28 2016-09-14 杭州迪普科技有限公司 Keyword filtering method and device
CN105681907A (en) * 2015-12-30 2016-06-15 中电长城网际系统应用有限公司 Information verification system and method thereof
CN106997363A (en) * 2016-01-26 2017-08-01 华为技术有限公司 A kind of data processing method and equipment
CN105635170B (en) * 2016-01-26 2018-12-18 宝利九章(北京)数据技术有限公司 The rule-based method and apparatus that network packet is identified
CN107153942B (en) * 2016-03-02 2021-02-26 北京京东尚科信息技术有限公司 Method for dynamically configuring and checking blacklist
CN106302436B (en) * 2016-08-11 2019-11-19 广州华多网络科技有限公司 A kind of autonomous discovery method, apparatus and equipment of attack message characteristics
CN106385345A (en) * 2016-09-23 2017-02-08 北京锐安科技有限公司 Method and apparatus for acquiring network data
CN106547878A (en) * 2016-10-26 2017-03-29 北京微网通联股份有限公司 Fast filtering method based on multi-key word
CN106657055B (en) * 2016-12-19 2019-11-15 北京网御星云信息技术有限公司 A kind of message filtering method and system
CN108460038A (en) * 2017-02-20 2018-08-28 阿里巴巴集团控股有限公司 Rule matching method and its equipment
CN106843996A (en) * 2017-03-08 2017-06-13 百富计算机技术(深圳)有限公司 Conditional compilation preprocess method and device
CN107645502B (en) * 2017-09-20 2021-01-22 新华三信息安全技术有限公司 Message detection method and device
CN108595566A (en) * 2018-04-13 2018-09-28 中国民航信息网络股份有限公司 Information cluster method and device
CN108833511A (en) * 2018-05-21 2018-11-16 聊城大学东昌学院 A kind of Artificial Intelligent Information Filtering system
CN110909149B (en) * 2018-09-17 2022-06-03 北京国双科技有限公司 Data filtering method and device
CN109204193B (en) * 2018-10-12 2021-05-14 杭州小驹物联科技有限公司 Method and system for quickly identifying automobile signals and parameters
CN109688205B (en) * 2018-12-07 2021-06-22 麒麟合盛网络技术股份有限公司 Webpage resource interception method and device
CN109905293B (en) * 2019-03-12 2021-06-08 北京奇虎科技有限公司 Terminal equipment identification method, system and storage medium
US11012414B2 (en) 2019-04-30 2021-05-18 Centripetal Networks, Inc. Methods and systems for prevention of attacks associated with the domain name system
US11012417B2 (en) 2019-04-30 2021-05-18 Centripetal Networks, Inc. Methods and systems for efficient packet filtering
CN111125693A (en) * 2019-12-18 2020-05-08 杭州安恒信息技术股份有限公司 Equipment safety protection method, device and equipment
CN111181980B (en) * 2019-12-31 2022-05-10 奇安信科技集团股份有限公司 Network security-oriented regular expression matching method and device
CN112364059B (en) * 2020-11-10 2023-12-22 国网甘肃省电力公司白银供电公司 Correlation matching method, device, equipment and storage medium under multi-rule scene
CN112615874B (en) * 2020-12-23 2022-11-15 北京天融信网络安全技术有限公司 Network protection method and device
CN113505585B (en) * 2021-07-15 2023-03-21 中南大学湘雅医院 High-speed character string feature matching method, device and equipment based on primitive state machine
CN114584632B (en) * 2022-02-24 2023-05-16 成都北中网芯科技有限公司 Deep packet inspection method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6182228B1 (en) * 1998-08-17 2001-01-30 International Business Machines Corporation System and method for very fast IP packet filtering
CN101257461A (en) * 2007-03-02 2008-09-03 华为技术有限公司 Method and apparatus for filtering content based on classification
CN101399749A (en) * 2007-09-27 2009-04-01 华为技术有限公司 Method, system and device for packet filtering
CN102857493A (en) * 2012-06-30 2013-01-02 华为技术有限公司 Content filtering method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101360088B (en) * 2007-07-30 2011-09-14 华为技术有限公司 Regular expression compiling, matching system and compiling, matching method
CN101841546B (en) * 2010-05-17 2013-01-16 华为技术有限公司 Rule matching method, device and system
CN102497319B (en) * 2011-12-13 2014-10-08 曙光信息产业(北京)有限公司 System and method for realizing single packet matching by utilizing automaton

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6182228B1 (en) * 1998-08-17 2001-01-30 International Business Machines Corporation System and method for very fast IP packet filtering
CN101257461A (en) * 2007-03-02 2008-09-03 华为技术有限公司 Method and apparatus for filtering content based on classification
CN101399749A (en) * 2007-09-27 2009-04-01 华为技术有限公司 Method, system and device for packet filtering
CN102857493A (en) * 2012-06-30 2013-01-02 华为技术有限公司 Content filtering method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899264A (en) * 2015-05-21 2015-09-09 东软集团股份有限公司 Multi-mode regular expression matching method and apparatus
CN107784478A (en) * 2016-08-31 2018-03-09 北京国双科技有限公司 The treating method and apparatus of administrative organization's information
CN107784478B (en) * 2016-08-31 2020-09-15 北京国双科技有限公司 Method and device for processing administrative institution information
CN115047835A (en) * 2022-06-27 2022-09-13 中国核动力研究设计院 Method, device, equipment and medium for acquiring periodic test data based on DCS (distributed control System)
CN115047835B (en) * 2022-06-27 2024-06-04 中国核动力研究设计院 DCS-based periodic test data acquisition method, device, equipment and medium

Also Published As

Publication number Publication date
CN102857493B (en) 2015-07-08
CN102857493A (en) 2013-01-02

Similar Documents

Publication Publication Date Title
WO2014000485A1 (en) Content filtration method and device
JP5943331B2 (en) Service process control method and network device
CN110943961B (en) Data processing method, device and storage medium
CN110519298B (en) Tor flow identification method and device based on machine learning
US8468220B2 (en) Methods of structuring data, pre-compiled exception list engines, and network appliances
CN109246064B (en) Method, device and equipment for generating security access control and network access rule
US8738906B1 (en) Traffic classification and control on a network node
EP3905622A1 (en) Botnet detection method and system, and storage medium
EP2868045B1 (en) A method of and network server for detecting data patterns in an input data stream
CN104065644A (en) Method and apparatus for recognizing CC attacks based on log analysis
CN103581909B (en) The localization method of a kind of doubtful mobile phone Malware and device thereof
CN106941493A (en) A kind of network security situation awareness result output intent and device
TW201119285A (en) Identification of underutilized network devices
US11888874B2 (en) Label guided unsupervised learning based network-level application signature generation
WO2021047402A1 (en) Application identification method and apparatus, and storage medium
CN102193948A (en) Feature matching method and device
CN114205191B (en) API gateway system and operation method
CN113905275A (en) Webpage filtering method and intelligent device
Wu et al. Detect repackaged android application based on http traffic similarity
CN113489702A (en) Interface current limiting method and device and electronic equipment
US11184282B1 (en) Packet forwarding in a network device
CN113992364B (en) Network data packet blocking optimization method and system
CN112565259B (en) Method and device for filtering DNS tunnel Trojan communication data
CN110110081B (en) Hierarchical classification processing method and system for mobile internet mass monitoring data
KR101802443B1 (en) Computer-executable intrusion detection method, system and computer-readable storage medium storing the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13809101

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13809101

Country of ref document: EP

Kind code of ref document: A1