WO2014000485A1

WO2014000485A1 - Content filtration method and device

Info

Publication number: WO2014000485A1
Application number: PCT/CN2013/073462
Authority: WO
Inventors: 尤里•哈桑; 艾维•菲尔; 莫默
Original assignee: 华为技术有限公司
Priority date: 2012-06-30
Filing date: 2013-03-29
Publication date: 2014-01-03
Also published as: CN102857493A; CN102857493B

Abstract

Embodiments of the present invention provide a content filtration method and device. The method comprises: respectively extracting a keyword from entered rule conditions; dividing the rule conditions into one or more groups according to the extracted keyword, and pre-compiling a group matching dataset for the extracted keyword; respectively pre-compiling a precise matching dataset for the rule conditions of the groups corresponding to the extracted keyword; obtaining to-be-filtered content; using the group matching dataset to perform keyword matching on the to-be-filtered content; using the precise matching dataset of the rule conditions of the groups corresponding to the matched keyword to perform precise matching of the rule conditions on the to-be-filtered content; and executing a corresponding filtration policy according to a matching result of the precise matching. The present invention performs group pre-filtration on the rule conditions; therefore the number of the rule conditions in each group is small, and occupied memory is reduced. However, the precise matching based on the rule conditions after the group pre-filtration has a higher matching accuracy.

Description

Content filtering method and device

Technical field

Embodiments of the present invention relate to data processing technologies, and in particular, to a content filtering method and apparatus. Background technique

As the world's largest information center, the Internet is growing at an alarming rate, but the information is mixed, and there are many bad websites and bad resources. There are also suspicious websites that contain malware that can threaten the user's personal privacy or even damage the user's computer.

In order to avoid the harm of bad information, the prior art uses a content filtering technology based on an application layer protocol to filter web pages. For example, for an enterprise network gateway, a filtering policy can be configured to filter webpages of certain types of content, thereby restricting behaviors prohibited by internal users of the enterprise network, such as prohibiting access to bad websites or watching online movies.

The prior art typically classifies a target website by using a Uniform Universa Resource Locator (URL) address in a Hypertext Transfer Protocol (HTTP) request message. If the web page is found to be of a type that should be filtered, such as pornography, violence, etc., redirect the HTTP request to another prompt page, or disconnect the network connection directly.

The existing content filtering technology generally pre-sets the rule conditions and filtering conditions by the user, and uses the pre-compiled filter to match the URL address of the requested webpage with the rule condition, and the URL address that matches the rule condition is matched, and then filters. Conditions are blocked or released. The rule condition may be, for example, a single string matching condition such as "if URL contains s ina" and "if URL equals www.abc.com", and each rule condition may be based on determining a finite state automaton (De termini st ic Fini te- The Sta te Automata (DFA) algorithm forms a DFA map, and each web page address is accurately matched based on the DFA map to determine whether it is consistent with the rule conditions. The filter condition may be, for example, "the policy of releasing the webpage when the "if" URL contains s ina", or "blocking or redirecting the webpage when the "if URL" is equal to www.abc.com"Strategy". Therefore, it is necessary to further match the webpage address that matches the rule condition in the filter condition to determine which one to execute. Processing strategy.

However, such content filtering techniques of the prior art have major drawbacks. The rule condition matching method for content filtering of URL addresses is performed by using DFA graphs. When the number of rule conditions is too large or requires support for complex rule condition configuration, for example, a regular expression type including a wildcard, ". */ Abc. */news" , ". *\. www\. doma in. *\. com", etc., U'J will encounter the problem of ilj consuming a lot of memory. This is the main drawback of the DFA algorithm. The prior art can use a compressed DFA, such as the D2FA (Delayed DFA) algorithm instead of the standard DFA for matching, but the matching performance is low because the time efficiency of the D2FA algorithm is several times lower than the standard DFA.

Therefore, how to balance the memory footprint and matching performance in the content filtering technology has become a technical problem to be solved in the prior art. Summary of the invention

Embodiments of the present invention provide a content filtering method and apparatus to reduce memory usage of content filtering and obtain a good matching effect.

An embodiment of the present invention provides a content filtering method, including:

Extract keywords from one or more rule conditions entered;

Dividing the one or more rule conditions into one or more packets according to the extracted keywords, so that the rule conditions in the same group have the same keyword, and pre-compiling the group matching data set for the extracted keywords;

Pre-compiling the exact matching data set for the rule condition of each of the extracted keywords corresponding to the grouping;

Get the content to be filtered;

Using the packet matching data set, performing keyword matching on the to-be-filtered content to obtain a matched keyword;

Using the exact matching data set of the rule condition of the matched keyword corresponding group, the rule condition is accurately matched to the content to be filtered;

Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. An embodiment of the present invention further provides a content filtering apparatus, including a content obtaining module, a content filtering module, and a policy implementation module, where

The content obtaining module is configured to acquire content to be filtered; The content filtering module includes:

a keyword extracting unit, configured to respectively extract keywords from one or more input rule conditions;

a packet compiling unit, configured to divide the one or more rule conditions into one or more packets according to the extracted keywords, so that rule conditions in the same group have the same keyword, and pre-select the extracted keywords Compiling a packet matching data set;

a rule condition compiling unit, configured to precompile an exact matching data set for a rule condition of each keyword corresponding to each of the extracted keywords;

a packet matching unit, configured to perform keyword matching on the to-be-filtered content by using the packet matching data set, to obtain a matched keyword;

a rule condition matching unit, configured to perform an exact match condition of the rule to be filtered by using an exact matching data set of a rule condition of the matched keyword corresponding group,

The policy implementation module is configured to perform a filtering policy corresponding to the matching result according to the matching result of the exact matching.

The content filtering method and apparatus provided by the embodiments of the present invention, because packet pre-filtering is performed on the rule condition based on the keyword, the number of rule conditions in each group is small, and the exact matching data set corresponding to each set of rule conditions is constructed. The sum of memory usage takes up less memory than a data set formed by precompiling all rule conditions. After the packet is pre-filtered and then based on the exact matching of the rule conditions, the accurate comparison between the content to be filtered and the rule condition can be ensured, and the matching accuracy is high. Therefore, the technical solution of the embodiment of the present invention optimizes the matching performance on the basis of occupying less memory, and obtains a more accurate matching result. DRAWINGS

1 is a flowchart of a content filtering method according to Embodiment 1 of the present invention;

2 is a flowchart of a content filtering method according to Embodiment 3 of the present invention;

3 is a flowchart of a content filtering method according to Embodiment 4 of the present invention;

4 is a flowchart of a content filtering method according to Embodiment 5 of the present invention;

FIG. 5 is a flowchart of a applicable example of Embodiment 5 of the present invention; FIG.

FIG. 6 is a schematic structural diagram of a content filtering apparatus according to Embodiment 6 of the present invention; FIG. FIG. 7 is a schematic structural diagram of a content filtering apparatus according to Embodiment 7 of the present invention; FIG. 8 is a schematic structural diagram of a content filtering apparatus according to Embodiment 8 of the present invention; FIG. 9 is a schematic diagram of a network architecture applicable to Embodiment 9 of the present invention;

FIG. 10 is a schematic diagram of a process for extracting a keyword in a content filtering method according to Embodiment 9 of the present invention;

FIG. 1 is a schematic diagram of a filtering process performed in a content filtering method according to Embodiment 9 of the present invention; FIG.

FIG. 12 is a schematic diagram showing a correspondence between a packet and an algorithm in a content filtering method according to an embodiment of the present invention;

FIG. 13 is a schematic structural diagram of a computer system according to an embodiment of the present invention;

FIG. 14 is a schematic structural diagram of a computer system according to another embodiment of the present invention. detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Embodiment 1

FIG. 1 is a flowchart of a content filtering method according to Embodiment 1 of the present invention. The content filtering method in this embodiment may be applicable to various scenarios that need to filter text content, and may be implemented by software and/or hardware. Web content filtering, typically performed based on a text application layer protocol, can be implemented by software integrated in the gateway.

The content filtering method mainly includes a pre-compilation process for the rule condition and a filtering process for the content to be filtered, and specifically includes the following steps:

Step 1 1 0: Extract keywords from one or more input rule conditions respectively; Step 1 20: Divide the one or more rule conditions into one or more groups according to the extracted keywords, so that the same group is in the same group Rule conditions have the same keyword, and precompile the packet matching data set for the extracted keywords;

Step 1 30: A rule strip corresponding to each keyword in each of the extracted keywords Precompiled exact match data set;

The above steps 1 1 0-1 30 are pre-compilation processes, which are to compile and process the rule conditions input by the user, so as to quickly match the filtered content when the filtering process is executed.

Step 140: Obtain content to be filtered.

Step 1 50: Perform, by using the group matching data set, a keyword matching of the to-be-filtered content to obtain a matched keyword;

Step 160: Perform exact matching of the ruled content on the to-be-filtered content by using the exact matching data set of the rule condition of the matched keyword corresponding grouping;

Step 170: Perform a filtering policy corresponding to the matching result according to the matching result of the exact matching.

The above steps 140-17 0 are content filtering processes, which are operations for matching the filtered content based on the matching data set constructed by the pre-compilation process.

The matching data set in the content filtering technology applicable to the rule condition and the filtering rule may be referred to as a content filtering rule base, and the rule condition and the filtering rule are generally dynamically configured by a user such as an administrator, instead of being manually/remotely updated by the device provider periodically. of. Therefore, how to automatically construct an efficient content filtering rule base based on the rule conditions and filtering rules entered by the user is a key issue in implementing the content filtering method.

Usually when implementing the content filtering technology, the user enters multiple rule conditions, which can be represented by a regular expression. The rule condition is generally the content matched by a field in the text application protocol. If multiple fields need to be matched in the filtering process, for example, different fields may include a URL address, a content type (Con t en t-Type) header field, a user agent (User-Agen t) header field, etc., Fields, the precompilation process is performed separately for the rule conditions corresponding to each field. The pre-compilation process executed by this embodiment is described by taking one field as an example. If the rule condition of multiple field contents is repeated, the technical solution of this embodiment may be repeatedly executed.

In the pre-compilation process of this embodiment, the extracted keywords are extracted from rule conditions based on a preset policy, and the keyword is a field that can represent the core content of the rule condition with a small number of characters as much as possible. The preset policy for extracting keywords that meet this requirement can be implemented in various ways, which will be introduced through subsequent embodiments. Since the extracted keywords are used to reflect the core content of the rule condition, the rule conditions are grouped based on the keywords, that is, the rule conditions with similar contents are grouped into the same group by grouping the rule pieces having the same keyword into one group. In the middle, the same keyword is not strict. The grid is limited to the same text, and the associated keywords can also be considered to have the same keyword based on the preset policy. Subsequently, on one hand, a group matching data set is pre-compiled for all keywords, and on the other hand, an exact matching data set is pre-compiled for each group of rule conditions. The so-called data set pre-compiled data according to a content matching algorithm, which can quickly complete string comparison when performing matching, such as pure string matching algorithm, non-deterministic finite state automaton ( Nondetermini stic Fini te-s ta te Automa Ta, abbreviated as NFA) I can be used as a matching data set, such as the algorithm, the DFA algorithm, and so on.

Both the packet matching data set and the exact matching data set preferably employ a matching algorithm capable of exactly matching the character string. For example, consider the balance of performance and memory footprint. According to memory specifications, the higher performance algorithm generally consumes more memory, and vice versa. Most of the network data needs to be processed by the packet matching algorithm, while a small amount of data is matched to the packet for further exact matching. Therefore, for the keyword matching algorithm of keywords, it can be tilted to improve performance, and ensure that the keywords are quickly matched. For the exact matching algorithm of the rule condition, it can be tilted in the direction of less memory occupation, so as to avoid the excessive increase of the rule condition and occupy too much memory.

Based on the packet matching data set and the exact matching data set constructed by the pre-compilation process, when the filtering process is executed, the content to be filtered is first matched with the group matching data set to identify whether the keyword to be filtered contains keywords, and Which keyword is included. When it is matched to include a certain keyword, the content to be filtered is accurately matched with the rule condition by using the accurate matching data set matched to the corresponding group of the keyword. The matching result can or cannot be matched to the rule condition. This matching result can be used as the basis for subsequent filtering rule identification or execution of the corresponding processing strategy. When the matching content to be filtered does not contain a keyword, it obviously does not match any rule condition, and the exact matching may not be performed. The matching result may also be used as a basis for executing the subsequent filtering policy.

In the technical solution of the embodiment, since the group pre-filtering is performed on the rule condition based on the keyword, the number of rule conditions in each group is small, and the sum of the memory occupied by each of the constructed exact matching data sets is larger than the data set compiled by all the rule conditions. Take up less memory. After the packet is pre-filtered and then based on the exact matching of the rule conditions, the content to be filtered can be accurately compared with the rule conditions, and the matching accuracy is high. Therefore, the technical solution of the embodiment optimizes the matching performance on the basis of occupying less memory, and obtains a more accurate matching result.

On the basis of the above embodiment, step 11 0 extracts the operation of the keyword, and there is still The possibility of extracting a keyword according to a preset policy. In such a case, the rule condition for which the keyword cannot be extracted may be discarded, but it is preferable to perform the following operations:

When it is recognized that the input rule condition cannot extract the keyword, the rule condition is put into the to-be-proposed group, and the exact matching data set is pre-compiled for the rule condition of the to-be-presented group, and the rule condition bad prompt is issued to the user.

Correspondingly, after the matching of the to-be-filtered content by using the packet matching data set, the method further includes: when the content to be filtered does not match the keyword, using the to-be-prompted packet The exact matching data set corresponding to the rule condition performs an exact matching of the rule conditions on the to-be-filtered content that does not match the keyword.

In the above case, the keyword cannot be extracted. It indicates that the content to be filtered containing the conditions of such a rule cannot be grouped according to the keyword and then matched exactly, and only a complete exact match can be performed. Accurately matching all the content to be filtered without keywords can further ensure the accuracy of all filtering, but this will not be conducive to reducing memory. At the same time, the exact matching performance of such rule conditions is usually lower than the packet matching. It consumes a lot of time performance. Therefore, such a situation can send a bad condition to the user, indicating that such rule conditions will increase the burden of the system's time and space performance, and should avoid setting such rule conditions.

In this embodiment, the content to be filtered may be a deep packet inspection (DPI) technology for protocol identification of the received data packet. Generally, the text type protocol type for content filtering includes HTTP. a protocol type such as a Session Initiation Protocol (SIP) or a Real Time Streaming Protocol (RTSP); based on the identified protocol, performing field parsing on the data packet to obtain at least one preset field Each preset field is respectively used as the content to be filtered, so as to perform subsequent group matching, exact matching, and filtering matching operations respectively. The filtering rule is a combination of one or more rule conditions, and the filtering rule is formed by combining one or more rule conditions corresponding to one or more preset fields. For example, the preset field may include a request method of an HTTP message in an HTTP protocol packet, a request URL, and a content type.

( Content-Type ) Header field, User-Agent header field, etc.

Embodiment 2

The content filtering method provided by the second embodiment of the present invention may further improve the pre-compilation and filtering process of the filtering rule based on the foregoing embodiment. In the above embodiment, the filtering rule Pre-compilation and filtering can be performed based on various technologies. For example, after matching the rule conditions, the corresponding identifiers are recorded, and then the filtering rules are respectively matched in the respective filtering rules based on the identification, and then the corresponding filtering policies are executed. Or use a tree structure to construct various filtering rules, and match the matched rule conditions in the tree structure.

This embodiment provides another preferred filtering rule matching scheme. At any time of the pre-compilation process, the following steps are performed:

Assigning a unique condition identifier to the one or more rule conditions, and pre-compiling the filter matching data set for the filtering rule, where the filtering rule is formed by combining one or more rule conditions, and using the one Or the condition identifier of the multiple rule conditions is used as a character to express the filtering rule, that is, the filtering rule expressed in the form of a character is pre-compiled into a filter matching data set, such as a DFA, a D2FA state machine, or the like;

Then, in the filtering process, performing a filtering policy corresponding to the matching result according to the matching result of the exact matching includes:

Using the filter matching data set, the condition identifier of the rule condition to which the content to be filtered is exactly matched is used as a character, and the filter rule is matched to the character, and the rule condition to which the content to be filtered is accurately matched is filtered by the rule. The content is precisely matched to the rule conditions.

The filtering rule is usually composed of one or more rule conditions. When the conditions of the rule are satisfied by the content to be filtered, the filtering rule is successfully matched, and the corresponding filtering policy is executed correspondingly, for example, the webpage is redirected to a prompt page to inform the user. The request has been blocked; the web page is directly discarded and the client connection is reset; the filtering policy such as the web page is released.

In this embodiment, the condition identifier of the rule condition is used as a character, and the form of the filter rule is a character string formed by the condition identifier, that is, the condition identifier of the condition rule is converted into a regular expression, and multiple filter rules can be uniformly pre-compiled and realized. Multi-mode matching, and then one-time matching can be used to determine which filtering rule is to be filtered, and no need to query multiple times to optimize filtering performance.

An example is provided below to illustrate. Suppose the filter rule can be " I f doma in =

"www\. porn. *\. com" and (User-Agent = ". *Chrome" or User-Agent = " . *Fi ref ox" ) and Content-Type = Any then Redi rec t . " , meaning Yes, if you use the "Chrome" or "F i ref ox" browser to access the "www\. porn. *\. com" adult website, then redirect this message to a prompt that has been filtered. "Content-Type" can It is arbitrary content, which can be omitted here and is reserved only for explaining the idea of the solution. Assume that the conditions of each rule condition are identified as follows:

"www\. porn. *\. com" = \x87

". *Chrome" = \x91

". *Firef ox" = \xl 3

You can then convert the filter rules directly into regular expressions:

If there are multiple filtering rules, the same reason is compiled together to form a filter matching data set, such as a DFA or D2FA state machine. When matching, it is executed in the order predefined by the filtering rules:

The first content to be filtered is a "Domain" field, which records the condition identifier of the rule condition to which the content to be filtered matches;

The second content to be filtered is a "User-Agent" field, which records the condition identifier of the rule condition to which the content to be filtered matches;

The third content to be filtered is the "Content-Type" field, which records the condition identifier of the rule condition to which the content to be filtered matches. Note that the last character of the regular expression is ".", indicating any;

Then, by using the filter matching data set, matching the matched condition identifiers to the filtering rules, the filtering policy can be learned.

In this way, if there are multiple filtering rules that need to be matched, it is only necessary to match each condition identifier once in order, and it is not necessary to match one by one, and the performance is significantly improved. At the same time, you can use D2FA instead of DFA to save memory.

When the number of condition identifiers is greater than 255, that is, a single character cannot be used as a condition identifier, all rule conditions can be identified by a double-byte condition. For example, the third condition identifier below is 525, that is, when hexadecimal 0x020d.

"www\. porn. *\. com" = \x87

". *Chrome" = \x91

". *Firefox" = \x02\x0d

The expression of the filter rule is converted to

" ^A \x00\x87\x00\x91\x02\x0d.. "

Embodiment 3 FIG. 2 is a flowchart of a content filtering method according to Embodiment 3 of the present invention. In the above embodiment, the pre-compilation processing of the rule condition and the filtering rule input by the user is introduced in the initial stage. In the actual application, the user can add, delete, and change the rule condition and the filtering rule at any time, and the change operation is equivalent to deleting first. Additional actions. In this embodiment, the operation of the newly added rule component is optimized, and the content filtering method further performs the following operations:

Step 2 1 0. When the newly added rule condition is obtained, the keyword is extracted from the newly added rule condition;

Step 220: Search or create a corresponding group according to a keyword extracted from the newly added rule condition, and recompile the group matching data set.

Specifically, the step may first search for an existing keyword in the existing group. If no corresponding keyword is found, a new group is created for the keyword, and the group matching data set is recompiled, and no corresponding correspondence is found. The keywords do not need to recompile the group matching data set.

Step 2 30: Precompile the accurate matching data set of the rule condition of the corresponding group according to the newly added rule condition;

The operation of this step distinguishes between the existing grouping and the new grouping, and is recompiled. There may be unused compilation methods for data sets implemented by different algorithms. Therefore, if DFA is used to compile all intra-group rule conditions into a state machine, the entire DFA state machine must be recompiled. If the packet uses block-by-single-mode matching, then Just compile the new rule conditions and add them to the matching chain.

Step 240: Assign a condition identifier to the newly added rule condition, and recompile the filter matching data set.

The technical solution of this embodiment can enable the user to flexibly add new rule conditions. The newly added rule condition only needs to update the group matching data set, the filtered matching data set, and a set of exact matching data sets. If the new rule condition does not generate a new one, For keywords, there is no need to update the group matching data set, and it is not necessary to adjust all the pre-compiled data sets relative to the prior art.

Embodiment 4

FIG. 3 is a flowchart of a content filtering method according to Embodiment 4 of the present invention. This embodiment further optimizes the operation process of deleting the rule condition based on the above embodiment. The content filtering method further includes the following steps: Step 31: Delete the instruction according to the input rule condition, determine the rule condition to be deleted or the condition identifier corresponding to the rule condition to be deleted, and extract the keyword from the rule to be deleted;

Step 320: Update a group matching data set according to a keyword extracted from a rule to be deleted.

Step 3: If the rule to be deleted is to be deleted, re-compile the exact matching data set with the rule of the corresponding group of the keywords extracted from the rule to be deleted, to delete the rule to be deleted.

Certainly, if the rule condition is not found in the corresponding group of the keyword, the exact matching data set of the group is deleted, the keyword is deleted, and the group matching data set is recompiled; Step 340, if the content needs to be deleted If the condition identifier corresponding to the rule condition is to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the to-be-deleted rule condition.

Similar to the third embodiment, this embodiment can flexibly delete the rule conditions without adjusting all the pre-compiled data sets.

Adding, deleting, and changing filtering rules are similar to the rule conditions. You can recompile the filtering matching data collection according to the newly added filtering rules or filtering rule deletion instructions to add or delete filtering rules.

Embodiment 5

4 is a flowchart of a content filtering method according to Embodiment 5 of the present invention. In the content filtering method provided by the foregoing embodiments, keyword extraction is performed, and the quality of keyword extraction is directly related to subsequent packet matching and accuracy. The performance of the match, as well as the amount of memory required by the content filtering rule base. The operations of extracting keywords from one or more of the input rule conditions may be implemented in various ways, for example, including the following steps:

Step 41: On the input rule condition, perform field division according to the preset division policy. Step 42: Filter the divided field based on the preset screening policy to obtain the keyword of the rule condition.

The operation of selecting the divided field based on the preset selection policy, and obtaining the keyword of the rule condition, preferably performs the following process:

From the divided field, the field that matches the field in the blacklist is deleted; according to the number of hits of the recorded field, the field with the number of hits higher than the hit threshold is deleted. Save

For each rule condition, the field with the least number of rule conditions for selecting the keyword group among the keywords of the rule condition is selected as the keyword of the rule condition.

However, those skilled in the art can understand that the above items can also be performed independently or in other orders. Other filtering strategies can be added, such as filtering the fields consistent with the fields in the whitelist as keywords.

In practical applications, multiple screening policies can be set according to requirements, and the execution order is not limited. The divided fields can be selected in multiple rounds to obtain the fields of the core content of the rules. Those skilled in the art can understand that the screening strategy of keywords is not limited to the above items. The basis for determining the preferred screening strategy is: The more the number of missed hits of the keyword or the higher the false hit rate, the lower the actual matching performance; the more the number of rule conditions in the packet, the more memory is occupied. Therefore, the strategy of extracting keywords should try to balance the matching performance and memory usage.

In addition to the static settings, the blacklist, the whitelist, and the number of missed hits can be updated by dynamic statistics, for example: the content to be filtered is subjected to the exact matching data set of the rule condition of the group corresponding to the matched keyword. After the exact matching of the rule condition, the method further includes: when the content to be filtered that matches the keyword does not match the corresponding rule condition by using the exact matching data set, updating the number of missed hits of the keyword;

Add keywords with missed hits above the set threshold to the blacklist.

By performing dynamic statistics based on the matching situation, the accuracy of the blacklist, whitelist, and number of missed hits can be updated to optimize the accuracy of the keyword extraction strategy, thereby optimizing the matching performance of the content filtering. Preferably, the extraction key, the grouping, and the pre-compilation operation are re-executed in the existing rule condition according to the set period, the number of missed hits and the blacklist, etc., to optimize the pre-compiled data set, and obtain better. Matching performance.

The following describes the extraction operation of the keyword in detail by way of example. FIG. 5 is a flowchart of an applicable example of Embodiment 5 of the present invention.

First, a keyword dynamic statistical table is maintained in the system, as shown in Table 1, wherein the number of missed hits can be refreshed in real time during the running of the content filtering method, for example, according to a set period, or according to a set trigger condition. Refresh in real time.

Table 1

Keyword hits The number of rule conditions grouped by this keyword is blacklisted Huaw 1 2 No goog 5 1 No

s ina 2 1 No

Yaho 1 1 No

Micr 9 2 No

News 0 3 No

Msdn 1 1 No

Www Yes

Com Yes

As described above, in the content filtering process, when the content to be filtered of a certain keyword is matched, and the exact matching data set is not matched to the corresponding rule condition, it indicates that the keyword has been hit incorrectly, corresponding to the keyword. The number of missed hits is incremented by 1.

Blacklists and whitelists can be statically configured. Or, add a keyword with a number of false hits above the set threshold to the blacklist, or add a keyword with a number of false hits below the set threshold to the whitelist. In practical applications, the number of missed hits can be considered as a factor, and the hit rate can be considered as a factor. The keyword dynamic maintenance table needs real-time updates, and is updated in real time as new keywords are extracted or deleted, and content filtering is performed.

Step 501: Obtain a rule condition that the device administrator enters the string form as a user online; for example, input the following rule conditions, where the rule condition may include a wildcard *, a range of character values [x-y], and the like:

1. www. huawei^. com

2. www [0-3] . huawei. com

3. *google. com/news

4. www. sina [0-9] . com

5. www. yahoo*, com/ news

6. *. microsof t, *

7. www. msdn. microsof t*/news

8. www. [a-z] [a-z] [a- z] . com. cn (bad condition rules)

First convert the ruled pieces into regular expressions, such as converting "." to "\ , , "*, and converting to ". *". Step 502: Perform field division on the input rule condition according to a preset division policy, and the purpose is to group the rule according to the keyword;

For example, the fields are divided according to the preset separators ".", "[", "]" or spaces, etc., and the number of characters of the field can be set, for example, only the number of strings below the set threshold is intercepted, such as only Extracting 4 characters and below, the above rule conditions divide the fields into 丽, huaw, com, goog, s ina, yaho, mi cr, msdn, and news.

Step 503: Delete the field in the blacklist based on the keyword dynamic maintenance table shown in Table 1;

That is, the www and com fields are deleted. The fields in the blacklist are usually too common fields and cannot be filtered.

Step 50: In the remaining fields after deleting the blacklist field, delete the field whose hit count is higher than the hit threshold according to the number of hit errors of the recorded field;

If the hit threshold is set to 4, then huaw, s ina, yaho, ms dn, and news are the filtered fields;

Step 505: Identify, from the filtered field, the number of rule conditions corresponding to each field, and select, for each rule condition, the field with the least number of rule conditions of the keyword group in each keyword of the rule condition. a keyword that is a condition of the rule;

The keywords corresponding to each rule condition after being filtered by step 505 are:

Huaw

3. news

4. s ina

5. yaho, news

6. No keywords

7. msdn, news

8. No keywords

After filtering through step 505, for rule condition 5, since the number of rule conditions of yaho is 1 in the keyword group of yaho and news, less than the number of rule conditions in the news group, rule condition 5 selects yaho as a key. Similarly, rule condition 7 selects ms dn as the key. The number of rule conditions for keyword grouping in Table 1 is the keyword of each rule Determine which is updated in real time.

If there is only one field left in the rule condition before any step before the screening of step 505, the field can be directly selected as a keyword. Conditional rules that do not extract keywords are bad conditional rules and need to be prompted to the user.

In the technical solutions of the above embodiments, the rule conditions are grouped according to the keywords, and the accurate matching data set pre-compiled after the grouping can use different compiling algorithms. The pre-compiled exact matching data set of the rule condition corresponding to each keyword in the extracted keywords may specifically include:

For a group of rule conditions whose number is less than the pre-configured threshold, the NFA, DFA, or compressed DFA regular expression matching algorithm is used to pre-compile the exact matching data set for the set of rule conditions, and the NFA regular expression algorithm is implemented, ij port PCRE (Per l Compa tible Regu lar Exp es si on ), or pre-compile an exact matching data set using a single-mode string matching algorithm, such as the BM ( Boyer Moor e ) matching algorithm. In this step, after identifying that the number of rule conditions is less than the pre-configured threshold, it may further determine that any regular expression related elements, such as wildcards, character ranges, etc., occur in the middle of the rule condition, and if so, NFA, DFA or compressed DFA, otherwise BM matching algorithm is used;

When the number of rule conditions is equal to or greater than the pre-configured threshold, the DFA or compressed DFA regular expression matching algorithm is used to precompile all rule conditions into an exact matching data set for the set of rule conditions, such as DFA, D2FA state machine. The pre-configured threshold can be set to 8, in order to take advantage of the performance of the D2FA multi-mode matching one-by-one matching with the single-mode matching algorithm. Or prefer spatial performance without considering the number of rules, and always use the NFA regular expression matching algorithm to pre-compile the rule conditions to the exact matching structure one by one;

For groupings that include rule conditions with set complex definition parameters, the NFA or compressed DFA regular expression matching algorithm is used to precompile the exact matching data set for the set of rule conditions. The so-called rule with set complex definition parameters may be a rule condition that is defined by experience to satisfy a certain degree of complexity to define a parameter. If such a rule condition is compiled into a DFA state opportunity, the number of states is sharply increased to occupy a large amount of memory, for example Floating, with "*,,,"? " ,

"+" Repeats the rule conditions of the wildcard multiple times. Floating means that the position of the expected pattern string is not fixed.

For example, in the above example, the rule conditions are grouped according to the selected keywords, When the group's pre-configured threshold is set to 2, the grouping situation and the exact matching data set used by each group can be as shown in Table 2 below:

Table 2

Of course, in practical applications, the algorithms used in each group are not limited to those shown in Table 2. As shown in Figure 12, other pre-compilations can also be selected for different groups.

Embodiment 6

FIG. 6 is a schematic structural diagram of a content filtering apparatus according to Embodiment 6 of the present invention. The content filtering apparatus may be integrated into an apparatus for performing content filtering, such as an enterprise gateway, for performing the content filtering method provided by the present invention. The content filtering device specifically includes a content obtaining module 61 0, a content filtering module 620, and a policy implementation module 630. The content obtaining module 610 is configured to obtain the content to be filtered. The content filtering module 620 specifically includes: a keyword extracting unit 621, a packet compiling unit 622, a rule condition compiling unit 623, a packet matching unit 624, and a rule condition matching unit 625. The keyword extracting unit 621 is configured to respectively extract keywords from the input one or more rule conditions; the grouping and compiling unit 622 is configured to divide the one or more rule conditions into one or more groups according to the extracted keywords, Making the rule conditions in the same group have the same keyword, and pre-compiling the group matching data set for the extracted keyword; the rule condition compiling unit 62 3 is configured to respectively correspond to each keyword in the extracted keyword The grouping rule condition pre-compiling the exact matching data set; the group matching unit is configured to perform keyword matching on the to-be-filtered content by using the packet matching data set to obtain a matched keyword; the rule condition matching unit 625 is configured to: The exact matching data set of the rule condition of the matched keyword corresponding to the matched keyword is used to perform exact matching of the ruled content. The policy implementation module 6 30 And a method for performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. The above technical solution provides pre-filtering of the filtered content by keyword grouping, and then performs exact matching, which can effectively balance the memory occupancy and matching performance precision, and provides an optimized content filtering scheme.

Based on the foregoing technical solution, the content filtering module 62 may further include a filtering rule compiling unit 626. The policy enforcement module 6 30 includes a filter rule matching unit 631 and a policy enforcement unit 632. The filtering rule compiling unit 626 is configured to separately allocate a unique condition identifier for the one or more rule conditions, and pre-compile the filter matching data set for the filtering rule, where the filtering rule is combined by one or more rule conditions. And the conditional identifier of the one or more rule conditions is used as a character to express the filtering rule; the filtering rule matching unit 6 31 is configured to use the filtering matching data set to accurately match the to-filtered content to the rule condition The condition identifier is used as a character to perform matching of the filter rule on the character, and the rule condition to which the content to be filtered is accurately matched is obtained by performing exact matching of the rule condition on the content to be filtered; the policy implementation unit 632 is configured to The matching result of the filtering rule performs a filtering policy corresponding to the matching result.

By using the conditional identifier to represent the rule condition and further compiling the filter rule in the form of a regular expression, a filter match can be achieved to obtain a match result.

Preferably, the rule condition compiling unit 62 3 is further configured to: when it is recognized that the input rule condition cannot extract the keyword, put the rule condition into the to-be-presented group, and pre-compile the rule condition of the group to be prompted. Matches the data collection and issues a bad rule condition to the user.

Correspondingly, the rule condition matching unit is further configured to: when the content to be filtered does not match the keyword, use the exact matching data set corresponding to the rule condition of the to-be-presented group to filter the unmatched keyword The content performs an exact match of the rule conditions.

The above technical solution can ensure an exact match for all the content to be filtered, and can prompt the user to optimize the rule conditions to meet the pre-filtered grouping requirements.

Example 7

FIG. 7 is a schematic structural diagram of a content filtering apparatus according to Embodiment 7 of the present invention. The present embodiment is based on the foregoing embodiment, where the keyword extracting unit 621 preferably includes: a field dividing subunit 621a and a field filtering subunit 621b. . The field dividing sub-unit 62 1 a is configured to perform field division according to a preset dividing policy for the input rule condition; the field selecting sub-unit 62 1 b , A keyword used to filter the divided fields based on a preset screening policy to obtain the rule conditions. The field filtering sub-unit is specifically configured to: delete, from the divided field, a field that is consistent with a field in the blacklist; according to the number of hits of the recorded field, the number of hits is higher than the hit threshold Delete; for each rule condition, select the field with the least number of rule conditions for the keyword grouping among the keywords of the rule condition as the keyword of the rule condition. However, those skilled in the art can understand that the foregoing items can also be executed independently or in other orders. Other screening policies can be added, such as filtering fields that match the fields in the whitelist as keywords.

To ensure the accuracy of the screening policy, the content filtering module may further include a statistical update unit, and the statistical update unit specifically includes: a hit count counter unit and a black list update sub unit. The number of hits is used to update the number of hits of the keyword when the content to be filtered that matches the keyword is not matched to the corresponding rule condition; the blacklist update subunit Used to blacklist keywords with a number of false hits above the set threshold.

The keyword extraction policy determines the quality of the keyword extraction, which is directly related to the pre-filtering efficiency. The technical solution in this embodiment can dynamically update the data used by the keyword screening policy according to the actual content filtering situation, so that the extracted keywords are more Can reflect the needs of content filtering.

On the basis of the above technical solutions, different matching algorithms can be used for different groups according to actual conditions, that is, the rule condition compiling unit specifically includes:

a first compiling subunit, configured to pre-compile an exact matching data set for the set of rule conditions using a NFA, DFA, or compressed DFA regular expression matching algorithm for a group whose rule condition is less than a pre-configured threshold value, or adopt a single The modulo string matching algorithm precompiles the exact matching data set;

a second compiling sub-unit, configured to pre-compile an exact matching data set for the set of rule conditions using a DFA or a compressed DFA regular expression matching algorithm for a group of rule conditions having a number equal to or greater than a pre-configured threshold;

A third compiling sub-unit is configured to pre-compile the exact matching data set for the set of rule conditions using a NFA or compressed DFA regular expression matching algorithm for the grouping comprising rule conditions having a set complex definition parameter.

Example eight FIG. 8 is a schematic structural diagram of a content filtering apparatus according to Embodiment 8 of the present invention. The present embodiment is based on the foregoing embodiment. The improvement is that the content obtaining module 610 may specifically include a protocol identifying unit 611 and a protocol parsing unit 612. The protocol identification unit 611 is configured to perform protocol identification on the received data packet by using a deep packet identification technology. The protocol parsing unit 612 is configured to perform field parsing on the data packet to obtain at least one pre- A field is set, and each preset field is separately used as a content to be filtered, so as to perform subsequent group matching, exact matching, and filtering matching operations respectively, wherein the filtering rule is composed of one or more rule conditions, and the filtering is performed. A rule is a combination of one or more rule conditions corresponding to one or more preset fields.

The content filtering apparatus provided by the embodiment of the present invention may perform the content filtering method provided by any embodiment of the present invention, and has a corresponding functional module structure.

Example nine

The ninth embodiment of the present invention will describe the details of the content filtering method in detail by way of a preferred example. The content filtering method provided by the embodiment of the present invention is performed based on a text application layer protocol, and the rule condition may be any field in the protocol, such as: a URL address, a request method, a certain header field, and the like. This embodiment uses the URL address field as an example for description. However, those skilled in the art can understand that the pre-compiled data set and the matching filtering method of other fields can be completed by the same scheme.

FIG. 9 is a schematic diagram of a network architecture applicable to a ninth embodiment of the present invention, where the network includes a local area network (LAN) network element, a wide area network (WAN) network element, and a router (Router). And switches (Swi t ch ) and so on. The user terminal is connected to the WAN via a LAN via a switch and a router. An application control node is deployed between the LAN and the WAN to implement content filtering. It should be understood that the application control node has the function of the content filtering device in the embodiment of the present invention. In different implementation manners, the application control node herein may be an enterprise router, or a gateway GPRS support node (Gatex GPRS Supper t Node, GGSN for short) network element device, an Internet gateway device, and a wireless controller device, etc. Network element.

The content filtering device is configured to participate in the embodiment 7 or 8 to specifically perform the content filtering method provided by the embodiment of the present invention. The method mainly includes a pre-compilation process and a filtering process.

FIG. 10 is a schematic diagram of a process for extracting keywords in a content filtering method according to Embodiment 9 of the present invention. Based on each screening policy, the first step is to divide the (Parse) field, and the second step is to divide the The keyword is filtered by the blacklist in the field; the third step filters the keyword according to the number of missed hits, and the fourth step selects the keyword according to the selection strategy with the least number of rule conditions. Finally, msdn is selected as a keyword from the rule conditions.

FIG. 11 is a schematic diagram of a filtering process performed in a content filtering method according to Embodiment 9 of the present invention, and FIG. 11 illustrates a rule condition pre-compilation phase and a rule condition matching filtering phase.

In the rule condition precompilation phase, the rule conditions entered are as follows:

1: www. huawei*. com

2: www [0-3] . huawei. com

3: *google. com/news

4: www. s ina [0-9] . com

5: www. yahoo*, com/news

6: *. microsof t. *

7: www.msdn. microsof t*/news

8: www. [a-z] [a-z] [a- z] . com. cn

According to the foregoing screening strategy, keywords are filtered for each rule condition, as shown in Fig. 11, the group matching data set is compiled by the AC state machine. According to the keyword grouping, as shown in Fig. 11, the first and second rule conditions are grouped into one group, the others are grouped by keyword, and the 6th and 8th uncharacterized rule conditions are classified into the bad rule condition group. Each uses an algorithm to precompile the exact matching data sets for each group.

In the rule condition matching phase, the content to be filtered is obtained and sent to the content filtering module, and the configured matching data set is pre-configured, and is also retained in the memory by the compiling process. As shown in FIG. 11, the content to be filtered is the website address www.huawei.com/news, the content filtering module first uses the group matching data set to perform keyword matching, for example, the content to be filtered is in the AC state machine. Multi-mode matching is performed, and the packet matching data set is used for pre-filtering, and the matched keyword is huaw.

Then, the exact matching data set of the packet corresponding to the keyword is further used to see if the rule condition can be matched, and the matching result is that the matching is successful.

Then, the conditional identifier of the matched rule condition can be used as a character, and the matching data set is matched by filtering. The matching results include matching success and failure, and the packet is processed according to the default release policy of the entire device configuration. For example, it can include a white list (matching successful release), There are two types of blacklists (matching successful filtering), and whether to send to the policy implementation module for further processing.

The content filtering solution provided by the embodiments of the present invention has many advantages, and can balance the problems of memory usage and matching performance. The solution supports complex rule conditions, such as regular expressions, and supports multi-dimensional content filtering matching, not just URL addresses, but also any configurable header field content filtering. Matching performance is improved by pre-filtering and dynamically collecting missed keywords. Dynamically collect keywords that affect performance, add blacklists, and periodically adjust the content filtering rule base, that is, periodically repeat the keyword-packet-precompilation process to achieve the optimal performance balance of the adaptive target operating environment. .

The embodiment of the present invention further provides a computer system, as shown in FIG. 13, the computer system includes at least one processor 1 31 and a memory 1 32; the memory 1 32 is used to store instructions; the processor 1 31, The memory 1 32 is coupled, and the processor 1 31 is configured to execute instructions stored in the memory 1 32 to perform the content filtering method provided by any of the embodiments of the present invention.

Specifically, the processor 1 31 can be configured to execute instructions stored in the memory 1 32 to perform the following process:

Extract keywords from one or more rule conditions entered;

Get the content to be filtered;

Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:

Assign a unique condition identifier to the one or more rule conditions, and pre-filter rules Generating a filter matching data set, wherein the filtering rule is formed by combining the one or more rule conditions, and using the condition identifier of the one or more rule conditions as a character to express the filtering rule;

Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching includes:

Using the filter matching data set, the condition identifier of the rule condition to which the content to be filtered is exactly matched is used as a character to match the filtering rule of the character, and the rule condition to be matched by the content to be filtered is filtered by the content to be filtered. An exact matching of the rule conditions is performed; and a filtering policy corresponding to the matching result is performed according to the matching result of the filtering rule. In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:

When the new rule condition is obtained, the keyword is extracted from the newly added rule condition; the corresponding rule is searched or created according to the keyword extracted from the newly added rule condition, and the group is recompiled. Matching data sets;

Pre-compiling an exact matching data set of rule conditions of the corresponding group according to the newly added rule condition;

Assign a condition ID to the new rule condition and recompile the filter match data set.

In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and further execute the following process:

Deleting an instruction according to the input rule condition, determining a rule condition to be deleted or a condition identifier corresponding to the condition to be deleted, and extracting a keyword from the rule to be deleted;

Updating the group matching data set according to the keyword extracted from the rule to be deleted; if the condition of the deleted rule is to be deleted, recompiling the exact matching data for the rule condition of the corresponding group of the keyword extracted from the rule to be deleted Collecting to delete the rule to be deleted;

If the condition identifier corresponding to the rule to be deleted is to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the rule to be deleted.

In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute an instruction stored in the memory 1 32, and the one or more rule conditions are input from the input. The keywords include the following processes:

For the input rule conditions, the fields are divided according to the preset division strategy;

The divided fields are filtered based on a preset screening policy to obtain keywords of the rule conditions.

In the above content filtering method process, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then filter the divided fields based on a preset screening policy to obtain the key of the rule condition. The word specifically includes the following process:

Deleting the field that matches the field in the blacklist from the divided field; deleting the field whose hit count is higher than the hit threshold according to the number of hits of the recorded field;

In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then match the exact matching data set of the rule condition of the corresponding group using the matched keyword. After the exact matching of the rule conditions to the content to be filtered of the keyword, the following process is further performed:

When the content to be filtered that matches the keyword is not matched to the corresponding rule condition by using the exact matching data set, the number of missed hits of the keyword is updated;

Add keywords with a number of false hit ratios above the set threshold to the blacklist.

In the above content filtering method flow, preferably, the processor 1 31 may be configured to execute an instruction stored in the memory 1 32, and then the rule conditions of each keyword corresponding to the extracted keywords are respectively Precompiling the exact match data set specifically includes the following process:

For a group of rule conditions whose number is less than the pre-configured threshold, for the set of rule conditions, use a non-deterministic finite state automaton, determine a finite state automaton, or compress the determined finite state automaton regular expression matching algorithm to precompile an exact match. Data collection, or pre-compiling an exact matching data set using a single-mode string matching algorithm;

For a group of rule conditions having a number equal to or greater than a pre-configured threshold, a set of rule conditions is used to determine a finite state automaton or a compressed finite state automaton regular expression matching algorithm to precompile the exact match data set;

For groupings that include rule conditions with set complex definition parameters, the set of rule conditions The finite state automaton regular expression matching algorithm is used to precompile the exact matching data set using a non-deterministic finite state automaton or compression.

In the above content filtering method process, preferably, the processor 1 31 is configured to execute the instructions stored in the memory 1 32, and the obtaining the content to be filtered specifically includes the following process: using the deep report on the received data packet Text recognition technology for protocol identification;

Performing field parsing on the data packet to obtain at least one preset field, and using each preset field as the content to be filtered, respectively, to perform subsequent group matching, exact matching, and filtering matching operations, respectively. The filtering rule is a combination of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.

In the above content filtering method flow, preferably, the processor 1 31 is configurable to execute the instructions stored in the memory 1 32, and further performs the following process:

In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then use the group matching data set to perform keyword matching on the to-be-filtered content. After that, the following process is also performed:

When the content to be filtered does not match the keyword, the exact matching data set corresponding to the rule condition of the prompting group is used, and the ruled condition of the unfiltered content that does not match the keyword is accurately matched.

In the above content filtering method flow, preferably, the processor 1 31 can be configured to execute the instructions stored in the memory 1 32, and then extracting the keywords from the input one or more rule conditions specifically includes the following processes:

The keywords are extracted from one or more rule conditions that have been entered according to the set period.

The embodiment of the present invention further provides a computer system. As shown in FIG. 14, the computer system includes: a processor 141, a memory 142, and a matching filter 143. The memory 142 is used to store instructions; the matching filter 143 is configured to configure each data set, such as a packet matching data set, an exact matching data set, and a filtered matching data set, etc.; the processor 141 is coupled to the memory 142 and the matching filter 14 3 The processor 141 is configured to execute the storage in the memory 142 An instruction to perform a pre-compilation process in the content filtering method provided by the embodiment of the present invention, and the processor 141 is further configured to invoke the matching filter 143 to perform content filtering in the content filtering method provided by the embodiment of the present invention. Process.

Preferably, the matching filter can be implemented by hardware, or a combination of hardware and software. For example, it can be a Field Programmable Gate Array (FPGA). Specifically, the memory of the FPGA chip or the external memory stores various data sets, such as a packet matching data set, an exact matching data set of each group, a filtered matching data set, and the like, and then the matching logic of each matching unit is also implemented by the FPGA chip. The various data sets perform content matching on the application protocol data, output the result of the keyword matching to the exact matching data set, or output an exact matching result to the corresponding filtering policy. Alternatively, the protocol identification and field parsing operations before content filtering can be implemented by the FPGA.

The computer system provided by the foregoing embodiment of the present invention can be configured as various network elements for applying content filtering technologies, such as an enterprise router, a gateway GPRS Supper t Node (GGSN) network element device, an Internet gateway device, and a wireless device. Controller device.

In the process of the processor executing the instructions of the memory and calling the matching filter, in particular, the processor can be configured to execute the instructions in the memory to:

Extract keywords from one or more rule conditions entered;

And the processor may be further configured to invoke the matching filter to: perform the following operations: acquiring the content to be filtered;

Performing a filtering policy corresponding to the matching result according to the matching result of the exact matching. Optionally, the processor is further configured to execute instructions in the memory to implement the following Operation:

Assigning a unique condition identifier to the one or more rule conditions, and pre-compiling the filter matching data set for the filtering rule, where the filtering rule is formed by combining one or more rule conditions, and using the one or more The condition identifier of the rule condition expresses the filter rule as a character;

The processor may be further configured to invoke the matching filter to: perform a filtering policy corresponding to the matching result according to the matching result of the exact matching, including: using the filtering matching data set, the content to be filtered The condition identifier of the rule condition that is precisely matched is used as a character to perform matching of the filter rule on the character, and the rule condition to which the content to be filtered is accurately matched is obtained by performing exact matching of the rule condition on the content to be filtered;

Performing a filtering policy corresponding to the matching result according to the matching result of the filtering rule. Alternatively, the processor can be further configured to execute instructions in the memory and also perform the following operations:

Alternatively, the processor can be further configured to execute instructions in the memory and also perform the following operations:

Updating the group matching data set according to the keyword extracted from the rule to be deleted; if the rule to be deleted is to be deleted, recompiling the exact matching data for the rule condition of the corresponding group of the keyword extracted from the rule to be deleted Collecting to delete the rule to be deleted;

If the condition identifier corresponding to the condition of the rule to be deleted is to be deleted, recompile the Filtering the data set to delete the condition identifier corresponding to the rule to be deleted.

The filter matching data set is recompiled according to the newly added filtering rule or filtering rule deletion instruction to add or delete a filtering rule.

Optionally, the processor is configurable to execute instructions in the memory to implement the following operations, respectively: extracting keywords from the input one or more rule conditions includes:

The divided fields are selected based on a preset selection policy, and the keywords for obtaining the rule conditions include:

Optionally, the processor is configured to execute the instructions in the memory to: perform an exact matching of the rule conditions on the to-be-filtered content by using an exact matching data set of the rule condition of the matched keyword corresponding to the matching keyword After that, it also includes:

Optionally, the processor is configured to execute the instructions in the memory to: perform pre-compilation of the rule conditions for each keyword corresponding to each of the extracted keywords, respectively, and the exact matching data set includes:

For packets whose number of rule conditions is less than the pre-configured threshold, pre-compile the exact match data set for the set of rule conditions using NFA, DFA, or compressed DFA regular expression matching algorithm, or pre-compile with single-mode string matching algorithm Accurately match data sets;

For a group whose rule condition is equal to or greater than the pre-configured threshold, the rule group for the group Precompiling an exact matching data set using DFA or a compressed DFA regular expression matching algorithm;

For groupings that include rule conditions with set complex definition parameters, the exact matching data set is precompiled for the set of rule conditions using the NFA or compressed DFA regular expression matching algorithm.

Optionally, the processor can be further configured to execute an instruction in the memory or to call a matching filter to:

The obtaining the content to be filtered includes:

Performing protocol identification on the received data packet using deep packet identification technology;

Optionally, the processor is further configurable to execute instructions in the memory to:

Optionally, the processor is further configured to: call the matching filter to: perform the following operations: after the matching of the to-be-filtered content by using the packet matching data set, the method further includes: when the content to be filtered When the keyword is not matched, the exact matching data set corresponding to the rule condition of the to-be-prompted packet is used to perform exact matching of the rule condition on the to-be-filtered content that does not match the keyword.

Optionally, the processor is configured to execute instructions in the memory to: extract the keywords from the input one or more rule conditions, respectively: according to the set period, from one or more rules that have been entered Extract keywords in the condition.

It will be understood by those skilled in the art that all or part of the steps of implementing the above method embodiments may be performed by hardware related to the program instructions. The aforementioned program can be stored in a computer readable storage medium. When the program is executed, the steps including the foregoing method embodiments are performed; and the foregoing storage medium includes: R0M, RAM, disk or optical disk, etc. The media of the sequence code.

It should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

claims

1. A content filtering method, characterized by including:

Extract keywords from one or more entered rule conditions;

Divide the one or more rule conditions into one or more groups according to the extracted keywords, so that the rule conditions in the same group have the same keywords, and precompile the group matching data set for the extracted keywords;

Precompile an exact matching data set for the rule conditions corresponding to the grouping of each keyword in the extracted keywords;

Get content to be filtered;

Using the group matching data set, perform keyword matching on the content to be filtered to obtain the matched keywords;

Utilize the exact matching data set of the rule conditions of the group corresponding to the matched keywords to perform an exact match of the rule conditions on the content to be filtered;

Execute a filtering policy corresponding to the matching result according to the exact matching result.

2. The content filtering method according to claim 1, further comprising: assigning a unique condition identifier to each of the one or more rule conditions, and precompiling a filter matching data set for the filtering rule, wherein: The filtering rule is composed of the one or more rule conditions, and the condition identifier of the one or more rule conditions is used as a character to express the filtering rule;

Then executing the filtering strategy corresponding to the matching result according to the matching result of the exact match includes:

Using the filter matching data set, the condition identifier of the rule condition accurately matched by the content to be filtered is used as a character to match the filtering rule with the character, and the rule condition accurately matched by the content to be filtered is determined by the content to be filtered Execute the filtering policy corresponding to the matching result according to the matching result of the filtering rule.

3. The content filtering method according to claim 2, further comprising: when obtaining a new rule condition, extracting keywords from the new rule condition; extracting keywords from the new rule condition according to Use the keywords to find or create corresponding groups for the new rule conditions, and recompile the group matching data set;

Precompile the exact matching number of the rule conditions of the corresponding group based on the newly added rule conditions data collection;

Assign a condition identifier to the newly added rule condition, and recompile the filtered matching data set.

4. The content filtering method according to claim 2, further comprising: deleting instructions according to the input rule conditions, determining the rule conditions to be deleted or the condition identifier corresponding to the rule conditions to be deleted, and selecting the rule conditions to be deleted from the rule conditions to be deleted. Extract keywords;

Update the group matching data set according to the keywords extracted from the rule conditions to be deleted; if the rule conditions to be deleted need to be deleted, recompile the exact matching data for the rule conditions corresponding to the group of keywords extracted from the rule conditions to be deleted Collection to delete the rule conditions to be deleted;

If the condition identifier corresponding to the rule condition to be deleted needs to be deleted, the filter matching data set is recompiled to delete the condition identifier corresponding to the rule condition to be deleted.

5. The content filtering method according to any one of claims 1 to 4, characterized in that: extracting keywords from one or more input rule conditions includes:

The divided fields are filtered based on the preset filtering strategy to obtain the keywords of the rule conditions.

6. The content filtering method according to claim 5, characterized in that: filtering the divided fields based on a preset filtering strategy, and obtaining the keywords of the rule conditions include: from the divided fields , delete the fields that are consistent with the fields in the blacklist; according to the recorded number of field false hits, delete the fields whose number of false hits is higher than the hit threshold;

For each rule condition, select the field with the smallest number of rule conditions for the keyword group among the keywords of the rule condition to filter as the keyword for the rule condition.

7. The content filtering method according to claim 6, wherein the exact matching data set of the rule conditions of the group corresponding to the matched keywords is used to accurately match the rule conditions for the content to be filtered that matches the keywords. After that, it also includes:

When the content to be filtered that matches a keyword does not match the corresponding rule condition using the exact matching data set, update the record of the number of false hits for the keyword;

Add keywords with a false hit rate higher than the set threshold to the blacklist.

8. The content filtering method according to any one of claims 1 to 4, characterized in that the pre-compiled accurate matching data set of rule conditions corresponding to groups of each keyword in the extracted keywords includes:

For a group whose number of rule conditions is less than the preconfigured threshold, a nondeterministic finite state automaton, a deterministic finite state automaton, or a compressed deterministic finite state automaton regular expression matching algorithm is used to precompile exact matching for the group of rule conditions. A data set, or a pre-compiled exact matching data set using a single-mode string matching algorithm;

For a group whose number of rule conditions is equal to or greater than the preconfigured threshold value, the deterministic finite state automaton or compressed deterministic finite state automaton regular expression matching algorithm is used to precompile the exact matching data set for this group of rule conditions;

For groups that include rule conditions with complex definition parameters, a nondeterministic finite state automaton or a compressed deterministic finite state automaton regular expression matching algorithm is used to precompile an exact matching data set for the group of rule conditions.

9. The content filtering method according to any one of claims 2 to 4, characterized in that said obtaining the content to be filtered includes:

Use deep text recognition technology to perform protocol identification on received data packets;

Based on the identified protocol, field parsing is performed on the data packet to obtain at least one preset field, and each preset field is used as content to be filtered in order to perform subsequent group matching, exact matching and filter matching operations respectively, where , the filtering rule is composed of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.

10. The content filtering method according to any one of claims 1 to 4, further comprising:

When it is recognized that the input rule condition cannot extract keywords, the rule software is put into the group to be prompted, an exact matching data set is pre-compiled for the rule conditions of the group to be prompted, and a bad rule condition prompt is issued to the user.

11. The content filtering method according to claim 10, characterized in that, after using the group matching data set to perform keyword matching on the content to be filtered, it further includes: when the content to be filtered does not match the key When the keywords are found, use the exact matching data set corresponding to the rule elements of the group to be prompted to rule the content to be filtered that does not match the keywords. Exact matching of parts.

12. The content filtering method according to any one of claims 1 to 4, characterized in that extracting keywords from one or more input rule conditions includes:

According to the set period, keywords are extracted from one or more entered rule conditions.

1 3. A content filtering device, characterized by including a content acquisition module, a content filtering module and a policy implementation module, wherein,

The content acquisition module is used to acquire content to be filtered;

The content filtering module includes:

The keyword extraction unit is used to extract keywords from one or more input rule conditions;

A group compilation unit, used to divide the one or more rule conditions into one or more groups according to the extracted keywords, so that the rule conditions in the same group have the same keywords, and prepare the extracted keywords Compile group matching data collection;

A rule condition compilation unit, configured to pre-compile an exact matching data set for rule conditions corresponding to groups of each keyword in the extracted keywords;

A group matching unit, configured to use the group matching data set to perform keyword matching on the content to be filtered to obtain the matched keywords;

The rule condition matching unit is used to use the exact matching data set of the rule conditions of the corresponding group of the matched keywords to accurately match the rule conditions for the content to be filtered;

The policy implementation module is configured to execute a filtering policy corresponding to the matching result according to the exact matching result.

14. The content filtering device according to claim 13, characterized in that:

The content filtering module also includes: a filtering rule compilation unit, used to allocate unique event identifiers to the one or more rule conditions, and pre-compile the filtering matching data set for the filtering rule, wherein the filtering rule consists of a or a combination of multiple rule conditions, and the condition identifier of the one or more rule conditions is used as a character to express the filtering rule;

The policy implementation module includes:

The filtering rule matching unit is used to use the filtered matching data set to use the condition identifier of the rule condition that the content to be filtered accurately matches as a character, and perform filtering rules on the character. Matching, the rule condition to which the content to be filtered is accurately matched is obtained by the exact matching of the rule condition to the content to be filtered;

A policy implementation unit, configured to execute a filtering policy corresponding to the matching result according to the matching result of the filtering rule.

15. The content filtering device according to claim 13 or 14, characterized in that the rule condition compilation unit is also used to put the rule condition into a prompt when it is recognized that the input rule condition cannot extract keywords. Group the data into groups, precompile an exact matching data set for the rule conditions of the group to be prompted, and issue a bad rule conditioner prompt to the user.

16. The content filtering device according to claim 15, wherein the rule condition matching unit is also configured to use an exact match corresponding to the rule condition of the group to be prompted when the content to be filtered does not match a keyword. The data collection is used to accurately match the rule conditions for the content to be filtered that does not match the keywords.

17. The content filtering device according to claim 13 or 14, characterized in that the keyword extraction unit includes:

The field division subunit is used to divide the input rule conditions into fields according to the preset division strategy;

The field filtering subunit is used to filter the divided fields based on the preset filtering strategy to obtain the keywords of the rule conditions.

18. The content filtering device according to claim 17, wherein the field filtering subunit is specifically used for:

From the divided fields, delete fields that are consistent with the fields in the blacklist; delete fields with a false hit rate higher than the hit threshold according to the recorded number of field false hits;

19. The content filtering device according to claim 18, wherein the content filtering module further includes a statistics update unit, and the statistics update unit includes:

The false hit count subunit is used to update the record of the false hit count of the keyword when the content to be filtered that matches the keyword does not match the corresponding rule condition using the exact matching data set; The blacklist update subunit is used to add keywords whose number of false hits is higher than the set threshold to the blacklist.

20. The content filtering device according to claim 13 or 14, characterized in that the rule condition compilation unit includes:

The first compilation subunit is used for groups whose number of rule conditions is less than the preconfigured threshold value. For this group of rule conditions, a non-deterministic finite state automaton is used to determine a finite state automaton or a compressed deterministic finite state automaton regular expression. The matching algorithm pre-compiles the exact matching data set, or uses the single-mode string matching algorithm to pre-compile the exact matching data set;

The second compilation subunit is used to pre-program a group whose number of rule conditions is equal to or greater than the preconfigured threshold value by using a deterministic finite state automaton or a compressed deterministic finite state automaton regular expression matching algorithm for this group of rule conditions. Compile an exact match data set;

The third compilation subunit is used for grouping rules that include rule conditions with complex definition parameters. For this group of rule conditions, a non-deterministic finite state automaton or a compressed deterministic finite state automaton regular expression matching algorithm is used to precompile accurately. Match data collection.

21. The content filtering device according to claim 13 or 14, characterized in that the content acquisition module includes:

The protocol identification unit is used to perform protocol identification on the received data packets using deep packet identification technology;

The protocol parsing unit is used to perform field parsing on the data packet based on the identified protocol to obtain at least one preset field, and use each preset field as content to be filtered, so as to perform subsequent group matching and accuracy respectively. Matching and filtering matching operations, wherein the filtering rule is composed of one or more rule conditions, and the filtering rule is composed of one or more rule conditions corresponding to one or more preset fields.