Embodiment
The keyword filtering technique refers to that the keyword that arranges is made up of both keyword at least in filter plant, one of them key word can be website URL, another key word can be the sign of illegal video, if do not match all key words in the keyword among the URL of the webpage to be visited that from user's network access request, extracts, then need not described network access request such as is blocked, abandons at filtration treatment.For example the URL of webpage to be visited has only matched the website URL in the keyword and has not matched the sign of the illegal video in the keyword, then thinks not match keyword, described network access request is not carried out filtration treatment.Thereby avoid occurring the user can't access websites in the situation of normal webpage URL.
Yet, how all keywords in the antistop list are mated efficiently, remain the problem that needs solve.
Be explained in detail to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.
As shown in Figure 1, the main realization principle process of the embodiment of the invention is as follows:
Step 10 is mated the key word in webpage URL to be visited and the malice antistop list, and the record matching result comprises the key word and the corresponding serial number that match in the described matching result.
In the malice antistop list, key word, the serial number of each key word in affiliated keyword that sequential storage has each keyword to comprise, described each keyword comprises both keyword at least.
It is first key word in the affiliated keyword that step 20, selection reference key word from matching result, the serial number of described benchmark key word identify this key word.
For each benchmark key word of selecting, carry out following steps 30~step 50:
Step 30, obtaining in the matching result with described benchmark keyword is a set of keyword of starting point, the serial number of the serial number of each key word and previous key word is continuous and the memory location is adjacent in this set of keyword.
Step 40 is judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent, if the number unanimity then carry out step 50, otherwise handled next benchmark keyword.
Step 50 confirms to comprise among the described webpage URL to be visited keyword in the malice antistop list, is malicious web pages.
Alternatively, upgrade for the ease of carrying out the malice antistop list, perhaps carry out the research of sociology aspect, need be added up by the probability on mating each keyword in the malice antistop list, particularly obtain statistics with strong points, perhaps for purpose that webpage is classified, need obtain the keyword that mates most with malicious web pages, yet in the process of circulation execution in step 30~step 50, the situation of at least two keywords in the malice antistop list may appear comprising among the webpage URL to be visited, therefore after carrying out above-mentioned steps 50, also comprise:
Record the affiliated keyword of described this benchmark key word, thereby obtain all keywords that described webpage URL to be visited comprises;
From all keywords that described webpage URL to be visited comprises, select institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most.For example, described webpage URL to be visited comprises two keywords, is respectively keyword A and keyword B, and wherein keyword A comprises 3 key words, and keyword B comprises 2 key words, and then keyword A is the keyword that mates most with described webpage URL to be visited.
Further, when selecting institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most, key word number maximum situation in all keywords that described webpage URL to be visited comprises that key word number that at least two keywords comprise is identical and comprise might occur, then when selecting, carry out:
If comprise the maximum keyword of key word number number be one, then with this keyword as the keyword that mates most with described webpage URL to be visited;
The number of the maximum keyword of key word number is at least one as if comprise, then according to the reference position of the described webpage URL to be visited of distance by near to order far away, therefrom select the nearest keyword of reference position of a described webpage URL to be visited of distance, as the keyword that mates most with described webpage URL to be visited.
For example, webpage URL to be visited comprises 3 keywords, be respectively keyword A, keyword B and keyword C, wherein keyword A and keyword C all comprise 3 key words, and keyword B comprises 2 key words, if webpage URL=to be visited " ... (irrelevant word)+keyword A+...... (irrelevant word)+keyword B+ keyword C ", then with keyword A as the keyword that mates most with described webpage URL to be visited.
Above-mentioned according to according to the reference position of the described webpage URL to be visited of distance by near to order far away, select the purpose of keyword to be, make each keyword selected probability equalization in comprising the identical keyword of key word number, therefore except above-mentioned selection mode, also can adopt other selection modes, for example select at random, perhaps selecting sequence is from far near here enumerated no longer one by one.
To introduce an embodiment in detail and come the main realization principle of the inventive method is explained in detail and illustrates according to foregoing invention principle of the present invention below.
Sequential storage has key word, the serial number of each key word in affiliated keyword that each keyword comprises in the malice antistop list that disposes in filter plant in advance.Wherein filtration unit can be equipment such as access gateway, router.
This malice antistop list can generate in the following way: the keyword that receives the managerial personnel of operator input, to each keyword according to input successively, number of words what etc. pre-defined rule sort, behind the keyword behind the memory sequencing, adopt the mode of character library coupling participle or artificial division, keyword is divided into both keyword at least, records the serial number of each key word in affiliated keyword.And the storage sequence number of storing each key word.
Fig. 2 is the example of the malice antistop list that provides in the present embodiment.Have 3 keywords in the present embodiment, keyword A is " baidu.com ", comprises both keyword " baidu " and " com "; Keyword B is " 123.google.com.456 ", comprises four key words " 123 ", " google ", " com " and " 456 "; Keyword C is " ns*df.ad_gd ", comprises both keyword " ns*df " and " ad_gd ".Each key word in affiliated keyword order and the storage sequence number of each key word as shown in Figure 2.
The process flow diagram of the malicious web pages recognition methods that accompanying drawing 3 provides for the embodiment of the invention, this method is identified based on malice antistop list as shown in Figure 2.
Step 301 is obtained user's network access request.
Network access request can be HTML (Hypertext Markup Language) (HTTP, HyperText Transfer Protocol) request, perhaps WAP (wireless application protocol) (WAP, Wireless Application Protocol) request etc.
Network access request can obtain in the following way: the packet that intercepting is flowed through from router or switch, identify network access request by protocol analysis successively, carry predefined condition code in the application layer data packet header of described network access request.
Step 302 is extracted webpage URL to be visited from described network access request.
For example, protocol analysis is carried out in request to HTTP, obtains webpage URL to be visited from the packet payload segment, and in the present embodiment, webpage URL to be visited is
Www.baidu.com? / v=123456
Step 303 is mated the key word in the malice antistop list shown in webpage URL to be visited and the accompanying drawing 2, the record matching result.
Comprise the key word and the corresponding serial number that match in the described matching result; Sequential storage has the key word that each keyword comprises, the corresponding serial number of this key word in affiliated keyword in the described malice antistop list.When mating, can adopt SBOM (Set Backward Oracle Matching), K-trie matching algorithm, perhaps adopt the AC state machine to mate.Step 3031~step 3035 shown in the accompanying drawing 4 is simply introduced adopting the process that the AC state machine mates.
The AC state machine algorithms is multi-pattern matching algorithm, has introduced the concept of finite state machine in the algorithm.Finite state machine be system might state and the expression of the information of the acceptable state conversion of this system.The processing of finite state machine is from original state, accepts an incoming event, moves to next suitable state according to incoming event from current state.The finte-state machine algorithm has been set up three functions, turns to function goto, and inefficacy function f ailure and output function output obtain matching result by these three functions.The processing procedure of pattern match has just become the processing procedure of state conversion like this.
Step 3031, key word in the malice antistop list shown in 2 makes up the possible state of institute of AC state machine with reference to the accompanying drawings, wherein original state is first key word in the malice antistop list, and other of AC state machine may states comprise in first key word and the malice antistop list combined result of other key words of at least one order (not necessarily continuous) arrangement after this.For example: original state is baidu (0), possible state comprise state S1:baidu (0)/com (1), baidu (0)/com (1)/123 (2) ..., state Sn:baidu (0)/123 (2)/google (3) ....
Step 3032 obtains first word identical with the original state of AC state machine among the URL to be visited;
In the present embodiment, webpage URL to be visited
Www.baidu.com? / v=123456In first word consistent with the key word in the malice antistop list be baidu.
Step 3033, in the AC state machine, import each word after the word identical with the original state of AC state machine among the URL to be visited successively, word of every input, then the AC state machine moves a next state, namely take place once to turn to, when taking place to turn to for the last time in the AC state machine, each key word in the key combination of record current state correspondence, and the storage sequence number of each key word.
In the present embodiment, after turning to successively at last, each key word in the key combination of current state correspondence, and the storage sequence number of each key word be: baidu (0), com (1,4), 123 (2), 456 (5).
Step 3034, according to the storage sequence number, each key word in the key combination of current state correspondence is sorted, storage sequence number by searching each key word in the malice antistop list and the corresponding relation of serial number, obtain matching result, each keyword after namely obtaining sorting, and the serial number of each key word as matching result.
For example, according to the storage sequence number, ranking results is baidu (0)/com (1)/123 (2)/com (4)/456 (5), searches malice antistop list shown in Figure 2, obtain matching result, be designated as key word (serial number): baidu (0)/com (1)/123 (0)/com (2)/456 (3).
Step 304 is selected the benchmark key word from matching result, it is first key word in the affiliated keyword that the serial number of described benchmark key word identifies this key word.
Step 305 judges whether the number of the benchmark keyword of selecting is 0, if the number of benchmark keyword is at least one, then for each benchmark key word of selecting, carries out following steps 306~step 309; If the number of benchmark keyword is 0, namely do not comprise the benchmark key word in the described matching result, then enter step 310.
Be selecting sequence in the present embodiment and number be 0 key word as the benchmark key word, the benchmark key word is baidu and/or 123.
Step 306, obtaining in the matching result with described benchmark keyword is a set of keyword of starting point, the serial number of the serial number of each key word and previous key word is continuous in this set of keyword.
Be that one group of keyword of starting point is baidu (0)/com (1) with benchmark keyword baidu in the matching result.
Because 123 (0) is discontinuous with the serial number of next keyword com (2) in the matching result, be that one group of keyword of starting point is 123 (0) with benchmark keyword 123 in the matching result then.
Step 307 is judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent, if the consistent step 308 that enters; If inconsistent, return step 305, handle next benchmark keyword.
In the present embodiment, be that the number of one group of keyword baidu (the 0)/com (1) of starting point is 2 with benchmark keyword baidu (0).With reference to accompanying drawing 2, the number of the key word that keyword baidu com comprises under the baidu (0) is 2.The number that with benchmark keyword baidu is the key word that keyword comprises under number and the baidu (0) of one group of keyword of starting point is identical.
In the present embodiment, be that the number of one group of keyword 123 (0) of starting point is 1 with benchmark keyword 123 (0).The number of the key word that comprises with reference to keyword 123google com 456 under the accompanying drawing 2,123 (0) is 4.The number that with benchmark keyword 123 is the key word that keywords comprise under the number and 123 (0) of one group of keyword of starting point is different.
Step 308 confirms to comprise among the described net URL to be visited keyword in the malice antistop list, is malicious web pages.
In the present embodiment, confirm to comprise among the webpage URL to be visited the affiliated keyword baidu com of baidu (0).
Alternatively, after step 308, also comprise:
Step 309, to described network access request filter, abandon, processing such as refusal.
Step 310 confirms not comprise among the described webpage URL to be visited keyword in the malice antistop list, is normal webpage.If namely do not comprise the benchmark key word in the matching result, can confirm directly that then webpage to be visited is normal webpage, has accelerated processing speed.
Alternatively, if adopt the purpose of the recognition methods shown in the accompanying drawing 3 only for filtering the network access request that comprises the malice keyword, so in step 305~step 307, as long as confirm to comprise among the webpage URL to be visited for the first time the keyword in the malice antistop list, namely as long as carry out a step 307, just can end process, further network access request is filtered.If up to handling last benchmark key word, all execution in step 307 not confirms that then webpage to be visited is normal webpage.If adopt the purpose of the recognition methods shown in the accompanying drawing 3 except filtration comprises the network access request of malice keyword, upgrade the malice antistop list in addition, perhaps sociology is added up etc., can repeated execution of steps 305~step 308, till each the benchmark keyword in handling matching result.
If each benchmark keyword in handling matching result, all execution in step 307 not then enters step 310.
In the identifying schemes of the malicious web pages that the embodiment of the invention provides, at first make up each key word include in the keyword and the malice antistop list of each key word serial number in affiliated keyword, wherein each keyword is made up of both keyword at least.Key word in webpage URL to be visited and the malice antistop list is mated, obtain comprising the key word that matches and match the matching result of serial number of keyword.According to matching result, identify webpage URL to be visited and whether comprise the malice keyword.Provided and to have reached a kind of implementation efficiently that keyword filters target.
Correspondingly, the embodiment of the invention also provides a kind of recognition device of malicious web pages, and as shown in Figure 5, this device comprises matching unit 501, selected cell 502, acquiring unit 503, judging unit 504 and first confirmation unit 505, and is specific as follows:
Matching unit 501, be used for the key word of webpage URL to be visited and malice antistop list is mated, the record matching result, sequential storage has both keyword at least, the serial number of each key word in affiliated keyword that each keyword comprises in the described malice antistop list; Comprise the key word that matches and the serial number of key word in the described matching result;
Selected cell 502 is used for from matching result selection reference key word, and it is first key word in the affiliated keyword that the serial number of described benchmark key word identifies this key word;
Acquiring unit 503, be used for if the number of the benchmark key word that selected cell 502 is selected is at least one, then for each benchmark key word of selecting, obtaining in the matching result with described benchmark key word is a set of keyword of starting point, and the serial number of the serial number of each key word and previous key word is continuous in this set of keyword;
Judging unit 504 be used for to be judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent;
First confirmation unit 505 is used for being malicious web pages URL if judging unit 504 judged results for being, then confirm to comprise among the described webpage URL to be visited the affiliated keyword of this benchmark key word.
Alternatively, obtain statistics with strong points when each keyword in the malice antistop list is added up by the probability on mating, perhaps webpage is carried out the branch time-like and obtain the keyword that mates most with malicious web pages, the device shown in the accompanying drawing 5 also comprises:
Statistic unit 506, be used for confirming that at first confirmation unit 505 described webpage URL to be visited comprises under this benchmark key word after the keyword, record the affiliated keyword of described this benchmark key word, thereby obtain all keywords that described webpage URL to be visited comprises;
Selected cell 507 is used for all keywords of comprising from the webpage URL to be visited that statistic unit 506 obtains, selects institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most.
Alternatively, for speed up processing, improve treatment effeciency, the device shown in the accompanying drawing 5 also comprises:
Second confirmation unit 508 is used for then confirming not comprise among the described webpage URL to be visited keyword in the malice antistop list if the number of the benchmark key word that selected cell 502 is selected is 0, is normal webpage URL.
One of ordinary skill in the art will appreciate that all or part of step that realizes in above-described embodiment method is to instruct relevant hardware to finish by program, this program can be stored in the computer read/write memory medium, as: ROM/RAM, magnetic disc, CD etc.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.