CN102339320B - Malicious web recognition method and device - Google Patents

Malicious web recognition method and device Download PDF

Info

Publication number
CN102339320B
CN102339320B CN 201110345080 CN201110345080A CN102339320B CN 102339320 B CN102339320 B CN 102339320B CN 201110345080 CN201110345080 CN 201110345080 CN 201110345080 A CN201110345080 A CN 201110345080A CN 102339320 B CN102339320 B CN 102339320B
Authority
CN
China
Prior art keywords
key word
keyword
visited
benchmark
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201110345080
Other languages
Chinese (zh)
Other versions
CN102339320A (en
Inventor
蒋柏梁
付凯
陈超
李亚文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Digital Technologies Chengdu Co Ltd
Original Assignee
Huawei Symantec Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Symantec Technologies Co Ltd filed Critical Huawei Symantec Technologies Co Ltd
Priority to CN 201110345080 priority Critical patent/CN102339320B/en
Publication of CN102339320A publication Critical patent/CN102339320A/en
Application granted granted Critical
Publication of CN102339320B publication Critical patent/CN102339320B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a malicious web recognition method and device. The malicious web recognition method and device can be used for solving the problem in the prior art that under the condition that webs provided by the same website include both normal webs and malicious webs, the user cannot be allowed to access the normal webs while the malicious webs are filtered out. The method comprises the following steps: matching a uniform resource locator (URL) of a to-be-accessed web with keywords in a malicious keyword list, recording a matching result; selecting a reference keyword from the matching result, wherein a sequence number of the reference keyword indicates that the keyword is the first keyword in affiliated keywords; and carrying out the following steps on each selected reference keyword: obtaining a group of keywords using the reference keyword as a start point; judging whether the number of the group of keywords is consistent with the number of the keywords in the keywords to which the reference keyword belongs; if so, confirming that the to-be-accessed web URL contains the keywords to which the reference keyword belongs and judging the to-be-accessed web is a malicious web.

Description

The recognition methods of malicious web pages and recognition device
Technical field
The present invention relates to technical field of the computer network, relate in particular to a kind of recognition methods of malicious web pages and a kind of recognition device of malicious web pages.
Background technology
The fast development of network technology, universal, and Internet resources is greatly abundant, when having satisfied the daily study of people and amusement demand, also brought some potential problems and harm to society, violence website, the porn site of for example spreading unchecked day by day, carry virulent webpage etc.These problems not only can make our personal computer face great risk, also teenagers growth are caused adverse effect simultaneously.
At the problems referred to above, URL(uniform resource locator) (URL, Uniform Resource Locator) filtering technique arises at the historic moment, the shared resource that provides in the network is normally located by URL, therefore identify and filter by the URL to corresponding illegal resource in the user access request, can reach and avoid the user to visit the purpose of illegal resource.For example to some illegal website, operator alarms or blocks it from the network side collocation strategy, thereby prevents that the user from visiting the illegal website.But this scheme is in the same website, existing legal normal webpage URL, has the situation of illegal malicious web pages URL just inapplicable again.Reason is as follows: the URL of webpage is made up of " page iden-tity on this website of the URL+ of website " usually, if the URL for restricting user access malicious web pages URL with the website adds in the blacklist, will cause the user can't visit the all-links of this website that comprises normal webpage URL.
Because a website can provide thousands of webpages usually, compares with the number of website URL, the number of webpage URL is much more, if still adopt the mode of blacklist to store malicious web pages URL, will take a large amount of storage spaces; And behind the URL that extracts webpage to be visited from user's network access request, when the URL that extracts and the malicious web pages URL in the blacklist were mated, the system resource that takies was bigger, and matching efficiency is lower.
At the problems referred to above, prior art has proposed the imagination of keyword filtering technique.Wherein, keyword is made up of both keyword at least, have only when on the URL of the webpage to be visited coupling during all key words in keyword (comprising each key word in this keyword among the URL of webpage to be visited), determine that this URL coupling has suffered this keyword.But prior art does not provide a kind of concrete scheme of the filtering technique of keyword efficiently.
Summary of the invention
The embodiment of the invention provides a kind of recognition methods of malicious web pages, in order to solve in the webpage that same website provides, existing normal webpage has again under the situation of malicious web pages, and prior art can't allow the user to visit the problem of normal webpage when filtering out malicious web pages.
Accordingly, the embodiment of the invention also provides a kind of recognition device of malicious web pages.
The technical scheme that the embodiment of the invention provides is as follows:
A kind of recognition methods of malicious web pages comprises:
Key word in webpage URL to be visited and the malice antistop list is mated, the record matching result, sequential storage has both keyword at least, the serial number of each key word in affiliated keyword that each keyword comprises in the described malice antistop list; Comprise the key word that matches and the serial number of key word in the described matching result;
It is first key word in the affiliated keyword that selection reference key word from matching result, the serial number of described benchmark key word identify this key word;
If the number of benchmark key word is at least one, then for each benchmark key word of selecting, carry out following steps:
Obtaining in the matching result with described benchmark key word is a set of keyword of starting point, and the serial number of the serial number of each key word and previous key word is continuous in this set of keyword;
Judge the number of this set of keyword, whether the number of the key word that comprises with keyword under this benchmark key word is consistent; If the number unanimity then confirms to comprise among the described webpage URL to be visited the affiliated keyword of this benchmark key word, be malicious web pages URL.
A kind of recognition device of malicious web pages comprises:
Matching unit, be used for the key word of webpage URL to be visited and malice antistop list is mated, the record matching result, sequential storage has both keyword at least, the serial number of each key word in affiliated keyword that each keyword comprises in the described malice antistop list; Comprise the key word that matches and the serial number of key word in the described matching result;
Selected cell is used for from matching result selection reference key word, and it is first key word in the affiliated keyword that the serial number of described benchmark key word identifies this key word;
Acquiring unit, be used for if the number of the benchmark key word that selected cell is selected is at least one, then for each benchmark key word of selecting, obtaining in the matching result with described benchmark key word is a set of keyword of starting point, and the serial number of the serial number of each key word and previous key word is continuous in this set of keyword;
Judging unit be used for to be judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent;
First confirmation unit is used for being malicious web pages URL if the judgment unit judges result for being, then confirms to comprise among the described webpage URL to be visited the affiliated keyword of this benchmark key word.
The embodiment of the invention is by mating the key word in webpage URL to be visited and the malice antistop list, the record matching result, sequential storage has both keyword at least, the serial number of each key word in affiliated keyword that each keyword comprises in the described malice antistop list; Comprise the key word that matches and the serial number of key word in the described matching result; It is first key word in the affiliated keyword that selection reference key word from matching result, the serial number of described benchmark key word identify this key word; At each benchmark key word, in matching result, be the number of the continuous key word of one group of serial number of starting point with described benchmark key word, when the number of the key word that comprises with keyword under this benchmark key word is consistent, confirm to comprise among the described webpage URL to be visited the affiliated keyword of this benchmark key word, be malicious web pages URL.Thereby provide a kind of and can distinguish normal webpage that same website provides and the implementation of malicious web pages.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do one to the accompanying drawing of required use in embodiment or the description of the Prior Art below introduces simply, apparently, accompanying drawing in describing below is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the main realization principle flow chart of the embodiment of the invention;
The synoptic diagram of the malice antistop list example that Fig. 2 provides for the embodiment of the invention;
The process flow diagram of the malicious web pages recognition methods that Fig. 3 provides for the embodiment of the invention;
Fig. 4 carries out the synoptic diagram of key word coupling for utilizing AC (Aho-Corasick) state machine in the embodiment of the invention;
The structural representation of the recognition device of the malicious web pages that Fig. 5 provides for the embodiment of the invention.
Embodiment
The keyword filtering technique refers to that the keyword that arranges is made up of both keyword at least in filter plant, one of them key word can be website URL, another key word can be the sign of illegal video, if do not match all key words in the keyword among the URL of the webpage to be visited that from user's network access request, extracts, then need not described network access request such as is blocked, abandons at filtration treatment.For example the URL of webpage to be visited has only matched the website URL in the keyword and has not matched the sign of the illegal video in the keyword, then thinks not match keyword, described network access request is not carried out filtration treatment.Thereby avoid occurring the user can't access websites in the situation of normal webpage URL.
Yet, how all keywords in the antistop list are mated efficiently, remain the problem that needs solve.
Be explained in detail to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.
As shown in Figure 1, the main realization principle process of the embodiment of the invention is as follows:
Step 10 is mated the key word in webpage URL to be visited and the malice antistop list, and the record matching result comprises the key word and the corresponding serial number that match in the described matching result.
In the malice antistop list, key word, the serial number of each key word in affiliated keyword that sequential storage has each keyword to comprise, described each keyword comprises both keyword at least.
It is first key word in the affiliated keyword that step 20, selection reference key word from matching result, the serial number of described benchmark key word identify this key word.
For each benchmark key word of selecting, carry out following steps 30~step 50:
Step 30, obtaining in the matching result with described benchmark keyword is a set of keyword of starting point, the serial number of the serial number of each key word and previous key word is continuous and the memory location is adjacent in this set of keyword.
Step 40 is judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent, if the number unanimity then carry out step 50, otherwise handled next benchmark keyword.
Step 50 confirms to comprise among the described webpage URL to be visited keyword in the malice antistop list, is malicious web pages.
Alternatively, upgrade for the ease of carrying out the malice antistop list, perhaps carry out the research of sociology aspect, need be added up by the probability on mating each keyword in the malice antistop list, particularly obtain statistics with strong points, perhaps for purpose that webpage is classified, need obtain the keyword that mates most with malicious web pages, yet in the process of circulation execution in step 30~step 50, the situation of at least two keywords in the malice antistop list may appear comprising among the webpage URL to be visited, therefore after carrying out above-mentioned steps 50, also comprise:
Record the affiliated keyword of described this benchmark key word, thereby obtain all keywords that described webpage URL to be visited comprises;
From all keywords that described webpage URL to be visited comprises, select institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most.For example, described webpage URL to be visited comprises two keywords, is respectively keyword A and keyword B, and wherein keyword A comprises 3 key words, and keyword B comprises 2 key words, and then keyword A is the keyword that mates most with described webpage URL to be visited.
Further, when selecting institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most, key word number maximum situation in all keywords that described webpage URL to be visited comprises that key word number that at least two keywords comprise is identical and comprise might occur, then when selecting, carry out:
If comprise the maximum keyword of key word number number be one, then with this keyword as the keyword that mates most with described webpage URL to be visited;
The number of the maximum keyword of key word number is at least one as if comprise, then according to the reference position of the described webpage URL to be visited of distance by near to order far away, therefrom select the nearest keyword of reference position of a described webpage URL to be visited of distance, as the keyword that mates most with described webpage URL to be visited.
For example, webpage URL to be visited comprises 3 keywords, be respectively keyword A, keyword B and keyword C, wherein keyword A and keyword C all comprise 3 key words, and keyword B comprises 2 key words, if webpage URL=to be visited " ... (irrelevant word)+keyword A+...... (irrelevant word)+keyword B+ keyword C ", then with keyword A as the keyword that mates most with described webpage URL to be visited.
Above-mentioned according to according to the reference position of the described webpage URL to be visited of distance by near to order far away, select the purpose of keyword to be, make each keyword selected probability equalization in comprising the identical keyword of key word number, therefore except above-mentioned selection mode, also can adopt other selection modes, for example select at random, perhaps selecting sequence is from far near here enumerated no longer one by one.
To introduce an embodiment in detail and come the main realization principle of the inventive method is explained in detail and illustrates according to foregoing invention principle of the present invention below.
Sequential storage has key word, the serial number of each key word in affiliated keyword that each keyword comprises in the malice antistop list that disposes in filter plant in advance.Wherein filtration unit can be equipment such as access gateway, router.
This malice antistop list can generate in the following way: the keyword that receives the managerial personnel of operator input, to each keyword according to input successively, number of words what etc. pre-defined rule sort, behind the keyword behind the memory sequencing, adopt the mode of character library coupling participle or artificial division, keyword is divided into both keyword at least, records the serial number of each key word in affiliated keyword.And the storage sequence number of storing each key word.
Fig. 2 is the example of the malice antistop list that provides in the present embodiment.Have 3 keywords in the present embodiment, keyword A is " baidu.com ", comprises both keyword " baidu " and " com "; Keyword B is " 123.google.com.456 ", comprises four key words " 123 ", " google ", " com " and " 456 "; Keyword C is " ns*df.ad_gd ", comprises both keyword " ns*df " and " ad_gd ".Each key word in affiliated keyword order and the storage sequence number of each key word as shown in Figure 2.
The process flow diagram of the malicious web pages recognition methods that accompanying drawing 3 provides for the embodiment of the invention, this method is identified based on malice antistop list as shown in Figure 2.
Step 301 is obtained user's network access request.
Network access request can be HTML (Hypertext Markup Language) (HTTP, HyperText Transfer Protocol) request, perhaps WAP (wireless application protocol) (WAP, Wireless Application Protocol) request etc.
Network access request can obtain in the following way: the packet that intercepting is flowed through from router or switch, identify network access request by protocol analysis successively, carry predefined condition code in the application layer data packet header of described network access request.
Step 302 is extracted webpage URL to be visited from described network access request.
For example, protocol analysis is carried out in request to HTTP, obtains webpage URL to be visited from the packet payload segment, and in the present embodiment, webpage URL to be visited is Www.baidu.com? / v=123456
Step 303 is mated the key word in the malice antistop list shown in webpage URL to be visited and the accompanying drawing 2, the record matching result.
Comprise the key word and the corresponding serial number that match in the described matching result; Sequential storage has the key word that each keyword comprises, the corresponding serial number of this key word in affiliated keyword in the described malice antistop list.When mating, can adopt SBOM (Set Backward Oracle Matching), K-trie matching algorithm, perhaps adopt the AC state machine to mate.Step 3031~step 3035 shown in the accompanying drawing 4 is simply introduced adopting the process that the AC state machine mates.
The AC state machine algorithms is multi-pattern matching algorithm, has introduced the concept of finite state machine in the algorithm.Finite state machine be system might state and the expression of the information of the acceptable state conversion of this system.The processing of finite state machine is from original state, accepts an incoming event, moves to next suitable state according to incoming event from current state.The finte-state machine algorithm has been set up three functions, turns to function goto, and inefficacy function f ailure and output function output obtain matching result by these three functions.The processing procedure of pattern match has just become the processing procedure of state conversion like this.
Step 3031, key word in the malice antistop list shown in 2 makes up the possible state of institute of AC state machine with reference to the accompanying drawings, wherein original state is first key word in the malice antistop list, and other of AC state machine may states comprise in first key word and the malice antistop list combined result of other key words of at least one order (not necessarily continuous) arrangement after this.For example: original state is baidu (0), possible state comprise state S1:baidu (0)/com (1), baidu (0)/com (1)/123 (2) ..., state Sn:baidu (0)/123 (2)/google (3) ....
Step 3032 obtains first word identical with the original state of AC state machine among the URL to be visited;
In the present embodiment, webpage URL to be visited Www.baidu.com? / v=123456In first word consistent with the key word in the malice antistop list be baidu.
Step 3033, in the AC state machine, import each word after the word identical with the original state of AC state machine among the URL to be visited successively, word of every input, then the AC state machine moves a next state, namely take place once to turn to, when taking place to turn to for the last time in the AC state machine, each key word in the key combination of record current state correspondence, and the storage sequence number of each key word.
In the present embodiment, after turning to successively at last, each key word in the key combination of current state correspondence, and the storage sequence number of each key word be: baidu (0), com (1,4), 123 (2), 456 (5).
Step 3034, according to the storage sequence number, each key word in the key combination of current state correspondence is sorted, storage sequence number by searching each key word in the malice antistop list and the corresponding relation of serial number, obtain matching result, each keyword after namely obtaining sorting, and the serial number of each key word as matching result.
For example, according to the storage sequence number, ranking results is baidu (0)/com (1)/123 (2)/com (4)/456 (5), searches malice antistop list shown in Figure 2, obtain matching result, be designated as key word (serial number): baidu (0)/com (1)/123 (0)/com (2)/456 (3).
Step 304 is selected the benchmark key word from matching result, it is first key word in the affiliated keyword that the serial number of described benchmark key word identifies this key word.
Step 305 judges whether the number of the benchmark keyword of selecting is 0, if the number of benchmark keyword is at least one, then for each benchmark key word of selecting, carries out following steps 306~step 309; If the number of benchmark keyword is 0, namely do not comprise the benchmark key word in the described matching result, then enter step 310.
Be selecting sequence in the present embodiment and number be 0 key word as the benchmark key word, the benchmark key word is baidu and/or 123.
Step 306, obtaining in the matching result with described benchmark keyword is a set of keyword of starting point, the serial number of the serial number of each key word and previous key word is continuous in this set of keyword.
Be that one group of keyword of starting point is baidu (0)/com (1) with benchmark keyword baidu in the matching result.
Because 123 (0) is discontinuous with the serial number of next keyword com (2) in the matching result, be that one group of keyword of starting point is 123 (0) with benchmark keyword 123 in the matching result then.
Step 307 is judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent, if the consistent step 308 that enters; If inconsistent, return step 305, handle next benchmark keyword.
In the present embodiment, be that the number of one group of keyword baidu (the 0)/com (1) of starting point is 2 with benchmark keyword baidu (0).With reference to accompanying drawing 2, the number of the key word that keyword baidu com comprises under the baidu (0) is 2.The number that with benchmark keyword baidu is the key word that keyword comprises under number and the baidu (0) of one group of keyword of starting point is identical.
In the present embodiment, be that the number of one group of keyword 123 (0) of starting point is 1 with benchmark keyword 123 (0).The number of the key word that comprises with reference to keyword 123google com 456 under the accompanying drawing 2,123 (0) is 4.The number that with benchmark keyword 123 is the key word that keywords comprise under the number and 123 (0) of one group of keyword of starting point is different.
Step 308 confirms to comprise among the described net URL to be visited keyword in the malice antistop list, is malicious web pages.
In the present embodiment, confirm to comprise among the webpage URL to be visited the affiliated keyword baidu com of baidu (0).
Alternatively, after step 308, also comprise:
Step 309, to described network access request filter, abandon, processing such as refusal.
Step 310 confirms not comprise among the described webpage URL to be visited keyword in the malice antistop list, is normal webpage.If namely do not comprise the benchmark key word in the matching result, can confirm directly that then webpage to be visited is normal webpage, has accelerated processing speed.
Alternatively, if adopt the purpose of the recognition methods shown in the accompanying drawing 3 only for filtering the network access request that comprises the malice keyword, so in step 305~step 307, as long as confirm to comprise among the webpage URL to be visited for the first time the keyword in the malice antistop list, namely as long as carry out a step 307, just can end process, further network access request is filtered.If up to handling last benchmark key word, all execution in step 307 not confirms that then webpage to be visited is normal webpage.If adopt the purpose of the recognition methods shown in the accompanying drawing 3 except filtration comprises the network access request of malice keyword, upgrade the malice antistop list in addition, perhaps sociology is added up etc., can repeated execution of steps 305~step 308, till each the benchmark keyword in handling matching result.
If each benchmark keyword in handling matching result, all execution in step 307 not then enters step 310.
In the identifying schemes of the malicious web pages that the embodiment of the invention provides, at first make up each key word include in the keyword and the malice antistop list of each key word serial number in affiliated keyword, wherein each keyword is made up of both keyword at least.Key word in webpage URL to be visited and the malice antistop list is mated, obtain comprising the key word that matches and match the matching result of serial number of keyword.According to matching result, identify webpage URL to be visited and whether comprise the malice keyword.Provided and to have reached a kind of implementation efficiently that keyword filters target.
Correspondingly, the embodiment of the invention also provides a kind of recognition device of malicious web pages, and as shown in Figure 5, this device comprises matching unit 501, selected cell 502, acquiring unit 503, judging unit 504 and first confirmation unit 505, and is specific as follows:
Matching unit 501, be used for the key word of webpage URL to be visited and malice antistop list is mated, the record matching result, sequential storage has both keyword at least, the serial number of each key word in affiliated keyword that each keyword comprises in the described malice antistop list; Comprise the key word that matches and the serial number of key word in the described matching result;
Selected cell 502 is used for from matching result selection reference key word, and it is first key word in the affiliated keyword that the serial number of described benchmark key word identifies this key word;
Acquiring unit 503, be used for if the number of the benchmark key word that selected cell 502 is selected is at least one, then for each benchmark key word of selecting, obtaining in the matching result with described benchmark key word is a set of keyword of starting point, and the serial number of the serial number of each key word and previous key word is continuous in this set of keyword;
Judging unit 504 be used for to be judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent;
First confirmation unit 505 is used for being malicious web pages URL if judging unit 504 judged results for being, then confirm to comprise among the described webpage URL to be visited the affiliated keyword of this benchmark key word.
Alternatively, obtain statistics with strong points when each keyword in the malice antistop list is added up by the probability on mating, perhaps webpage is carried out the branch time-like and obtain the keyword that mates most with malicious web pages, the device shown in the accompanying drawing 5 also comprises:
Statistic unit 506, be used for confirming that at first confirmation unit 505 described webpage URL to be visited comprises under this benchmark key word after the keyword, record the affiliated keyword of described this benchmark key word, thereby obtain all keywords that described webpage URL to be visited comprises;
Selected cell 507 is used for all keywords of comprising from the webpage URL to be visited that statistic unit 506 obtains, selects institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most.
Alternatively, for speed up processing, improve treatment effeciency, the device shown in the accompanying drawing 5 also comprises:
Second confirmation unit 508 is used for then confirming not comprise among the described webpage URL to be visited keyword in the malice antistop list if the number of the benchmark key word that selected cell 502 is selected is 0, is normal webpage URL.
One of ordinary skill in the art will appreciate that all or part of step that realizes in above-described embodiment method is to instruct relevant hardware to finish by program, this program can be stored in the computer read/write memory medium, as: ROM/RAM, magnetic disc, CD etc.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (8)

1. the recognition methods of a malicious web pages is characterized in that, comprising:
Key word in webpage uniform resource position mark URL to be visited and the malice antistop list is mated, the record matching result, sequential storage has both keyword at least, the serial number of each key word in affiliated keyword that each keyword comprises in the described malice antistop list; Comprise the key word that matches and the serial number of key word in the described matching result;
It is first key word in the affiliated keyword that selection reference key word from matching result, the serial number of described benchmark key word identify this key word;
If the number of benchmark key word is at least one, then for each benchmark key word of selecting, carry out following steps:
Obtaining in the matching result with described benchmark key word is a set of keyword of starting point, and the serial number of the serial number of each key word and previous key word is continuous and the memory location is adjacent in this set of keyword;
Judge the number of this set of keyword, whether the number of the key word that comprises with keyword under this benchmark key word is consistent; If the number unanimity then confirms to comprise among the described webpage URL to be visited the affiliated keyword of this benchmark key word, be malicious web pages.
2. the method for claim 1 is characterized in that, described key word in webpage URL to be visited and the malice antistop list is mated, and the record matching result comprises:
Make up the possible state of institute of Aho-Corasick AC state machine according to the key word in the malice antistop list, wherein original state is first key word in the malice antistop list, and other may states comprise: in first key word and the malice antistop list after described first key word the combined result of at least one tactic other key word;
Obtain first word identical with the original state of AC state machine among the URL to be visited;
In the AC state machine, import each word after the word identical with the original state of AC state machine among the URL to be visited successively, when the AC state machine takes place to turn to for the last time, then record each key word in the key combination of current state correspondence, and the storage sequence number of each key word;
According to the storage sequence number, each key word in the key combination of current state correspondence is sorted, storage sequence number by searching each key word in the malice antistop list and the corresponding relation of serial number, obtain after the ordering each key word, and the serial number of each key word as matching result.
3. method as claimed in claim 1 or 2 is characterized in that, comprises among the described webpage URL to be visited of described affirmation under this benchmark key word behind the keyword, also comprises:
Record the affiliated keyword of described this benchmark key word, thereby obtain all keywords that described webpage URL to be visited comprises;
From all keywords that described webpage URL to be visited comprises, select institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most.
4. method as claimed in claim 3 is characterized in that, the described institute of selecting comprises the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most, comprising:
If comprise the maximum keyword of key word number number be one, then with this keyword as the keyword that mates most with described webpage URL to be visited;
If the number that institute comprises the maximum keyword of key word number is at least one, then according to the reference position of the described webpage URL to be visited of distance by near to order far away, the keyword of selecting a keyword conduct and described webpage URL to be visited to mate most.
5. the method for claim 1 is characterized in that, if the number of benchmark key word is 0, then confirms not comprise among the described webpage URL to be visited keyword in the malice antistop list, is normal webpage URL.
6. the recognition device of a malicious web pages is characterized in that, comprising:
Matching unit, be used for the key word of webpage URL to be visited and malice antistop list is mated, the record matching result, sequential storage has both keyword at least, the serial number of each key word in affiliated keyword that each keyword comprises in the described malice antistop list; Comprise the key word that matches and the serial number of key word in the described matching result;
Selected cell is used for from matching result selection reference key word, and it is first key word in the affiliated keyword that the serial number of described benchmark key word identifies this key word;
Acquiring unit, be used for if the number of the benchmark key word that selected cell is selected is at least one, then for each benchmark key word of selecting, obtaining in the matching result with described benchmark key word is a set of keyword of starting point, and the serial number of the serial number of each key word and previous key word is continuous in this set of keyword;
Judging unit be used for to be judged the number of this set of keyword, and whether the number of the key word that comprises with keyword under this benchmark key word is consistent;
First confirmation unit is used for being malicious web pages URL if the judgment unit judges result for being, then confirms to comprise among the described webpage URL to be visited the affiliated keyword of this benchmark key word.
7. device as claimed in claim 6 is characterized in that, also comprises:
Statistic unit is used for confirming that at first confirmation unit described webpage URL to be visited comprises under this benchmark key word after the keyword, record the affiliated keyword of described this benchmark key word, thereby obtains all keywords that described webpage URL to be visited comprises;
Selected cell is used for all keywords of comprising from the webpage URL to be visited that statistic unit obtains, selects institute to comprise the keyword that the maximum keyword conduct of key word number and described webpage URL to be visited mate most.
8. as claim 6 or 7 described devices, it is characterized in that, also comprise:
Second confirmation unit is used for then confirming not comprise among the described webpage URL to be visited keyword in the malice antistop list if the number of the benchmark key word that selected cell is selected is 0, is normal webpage URL.
CN 201110345080 2011-11-04 2011-11-04 Malicious web recognition method and device Expired - Fee Related CN102339320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110345080 CN102339320B (en) 2011-11-04 2011-11-04 Malicious web recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110345080 CN102339320B (en) 2011-11-04 2011-11-04 Malicious web recognition method and device

Publications (2)

Publication Number Publication Date
CN102339320A CN102339320A (en) 2012-02-01
CN102339320B true CN102339320B (en) 2013-08-28

Family

ID=45515053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110345080 Expired - Fee Related CN102339320B (en) 2011-11-04 2011-11-04 Malicious web recognition method and device

Country Status (1)

Country Link
CN (1) CN102339320B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103428183B (en) * 2012-05-23 2017-02-08 北京新媒传信科技有限公司 Method and device for identifying malicious website
CN102801709B (en) * 2012-06-28 2015-03-04 北京奇虎科技有限公司 Phishing website identification system and method
CN102855320A (en) * 2012-09-04 2013-01-02 珠海市君天电子科技有限公司 Method and device for collecting keyword related URL (uniform resource locator) by search engine
CN103368957B (en) * 2013-07-04 2017-03-15 北京奇虎科技有限公司 Method and system that web page access behavior is processed, client, server
CN103455758A (en) * 2013-08-22 2013-12-18 北京奇虎科技有限公司 Method and device for identifying malicious website
CN103475642A (en) * 2013-08-22 2013-12-25 北京奇虎科技有限公司 Malicious forum identification method and malicious forum identification device
CN103501306B (en) * 2013-10-23 2016-09-14 腾讯科技(武汉)有限公司 A kind of network address knows method for distinguishing, server and system
WO2015087835A1 (en) * 2013-12-10 2015-06-18 日本電信電話株式会社 Url matching device, url matching method, and url matching program
CN103955449B (en) * 2014-04-21 2018-03-06 安一恒通(北京)科技有限公司 The method and apparatus for positioning target sample
CN104158828B (en) * 2014-09-05 2018-05-18 北京奇虎科技有限公司 The method and system of suspicious fishing webpage are identified based on cloud content rule base
CN104486098A (en) * 2014-11-26 2015-04-01 中国建设银行股份有限公司 Access fault monitoring method and device
CN104540024B (en) * 2014-12-18 2019-03-08 网宿科技股份有限公司 Video terminal and its method and system for limiting video playing
CN105138912A (en) * 2015-09-25 2015-12-09 北京奇虎科技有限公司 Method and device for generating phishing website detection rules automatically
CN105938475A (en) * 2015-12-28 2016-09-14 杭州迪普科技有限公司 Keyword filtering method and device
CN107526967B (en) * 2017-07-05 2020-06-02 阿里巴巴集团控股有限公司 Risk address identification method and device and electronic equipment
CN109241462B (en) * 2018-08-13 2021-12-14 中国联合网络通信集团有限公司 Webpage black word processing method, device, equipment and storage medium
CN111061972B (en) * 2019-12-25 2023-05-16 武汉绿色网络信息服务有限责任公司 AC searching optimization method and device for URL path matching

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196898A (en) * 2007-08-21 2008-06-11 新百丽鞋业(深圳)有限公司 Method for applying phrase index technology into internet search engine

Also Published As

Publication number Publication date
CN102339320A (en) 2012-02-01

Similar Documents

Publication Publication Date Title
CN102339320B (en) Malicious web recognition method and device
CN102857493B (en) Content filtering method and device
CN103530365B (en) Obtain the method and system of the download link of resource
CN103116638B (en) Webpage screening method and device thereof
CN107526807A (en) Information recommendation method and device
CN104933056A (en) Uniform resource locator (URL) de-duplication method and device
CN103581909B (en) The localization method of a kind of doubtful mobile phone Malware and device thereof
CN104516982A (en) Method and system for extracting Web information based on Nutch
Sukumar et al. Review on modern Data Preprocessing techniques in Web usage mining (WUM)
CN103077250B (en) A kind of capturing webpage contents method and device
CN109104421B (en) Website content tampering detection method, device, equipment and readable storage medium
CN104537070A (en) Method and equipment for mining tourism destination attractions
CN109359263B (en) User behavior feature extraction method and system
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN103455758A (en) Method and device for identifying malicious website
CN106033428B (en) The selection method of uniform resource locator and the selection device of uniform resource locator
CN107483381A (en) The monitoring method and device of interlock account
CN107832333B (en) Method and system for constructing user network data fingerprint based on distributed processing and DPI data
CN104967698B (en) A kind of method and apparatus crawling network data
CN103123640A (en) Method and device for searching novel
CN105677921A (en) Method and system for acquiring Internet public opinion data
CN109756467A (en) A kind of recognition methods of fishing website and device
CN106411951A (en) Network attack behavior detection method and device
CN105989019B (en) A kind of method and device for cleaning data
CN105630983A (en) Resource obtaining and optimizing device and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB02 Change of applicant information

Address after: High tech Park No. 88 University of Electronic Science and technology of Sichuan province in 611721 Chengdu city high tech Zone West Park area Qingshui River Tianchen Road No. 5 building D

Applicant after: Huawei Symantec Technologies Co., Ltd.

Address before: High tech Park No. 88 University of Electronic Science and technology of Sichuan province in 611721 Chengdu city high tech Zone West Park area Qingshui River Tianchen Road No. 5 building D

Applicant before: Chengdu Huawei Symantec Technologies Co., Ltd.

COR Change of bibliographic data

Free format text: CORRECT: APPLICANT; FROM: CHENGDU HUAWEI SYMANTEC TECHNOLOGIES CO., LTD. TO: HUAWEI DIGITAL TECHNOLOGY (CHENGDU) CO., LTD.

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130828

Termination date: 20201104

CF01 Termination of patent right due to non-payment of annual fee