CN105956472B

CN105956472B - Identify webpage in whether include hostile content method and system

Info

Publication number: CN105956472B
Application number: CN201610313359.3A
Authority: CN
Inventors: 李唱; 康靖; 陈虎
Original assignee: Baoli Nine Chapters (beijing) Data Technology Co Ltd
Current assignee: Quantum innovation (Beijing) Information Technology Co., Ltd
Priority date: 2016-05-12
Filing date: 2016-05-12
Publication date: 2019-10-18
Anticipated expiration: 2036-05-12
Also published as: CN105956472A

Abstract

The invention discloses identification webpage in whether include hostile content method, one of recognition methods comprising steps of parse the URL of webpage to be identified with from URL extract URL feature to generate fisrt feature collection；First eigenvector is generated according to fisrt feature collection；And the first eigenvector is handled using fisrt feature model, and export the first result to characterize whether the webpage to be identified includes hostile content.The invention also discloses other three kinds of recognition methods and it is corresponding identification webpage in whether include hostile content system.

Description

Identify webpage in whether include hostile content method and system

Technical field

Whether the present invention relates to include the method for hostile content in technical field of network security, especially identification webpage and be System.

Background technique

With internet development, the application based on WEB is also become increasingly popular, and people can inquire bank's account by browser Family, shopping online etc., WEB provide a convenient efficiently interactive mode.But accompanying problem is that: a large amount of malice Website attack is incremented by double year by year, pretends identity by a series of technological means to gain the trust of user by cheating, and then seek non- Method interests, user is under the attack of malicious websites by huge economic loss.Therefore how to identify hostile content in webpage, Preventing malice website becomes the significantly research topic of network safety filed one.

The technology of existing preventing malice website is mainly the URL of a given suspicious webpage, sends it to blacklist Database is inquired, however since fishing website is kept updating, inspection of this method to malicious websites such as fishing websites Extracting rate is not high and has hysteresis quality.Either by scanning web page contents, searches and whether there is malice keyword in webpage；Or The essential characteristic for extracting Web page image, calculates the similarity between suspicious webpage and true webpage, judges suspicious webpage with this Whether there is imitation suspicion, but the above method has respective limitation, causes False Rate higher.

Summary of the invention

For this purpose, the present invention provides identification webpage in whether include hostile content method and system, with try hard to solve or Person at least alleviates at least one existing problem above.

According to an aspect of the invention, there is provided it is a kind of identification webpage in whether include hostile content method, including Step: the URL of webpage to be identified is parsed to extract URL feature from URL to generate fisrt feature collection；It is raw according to fisrt feature collection At first eigenvector；And the first eigenvector is handled using fisrt feature model, and export the first result with table Levy whether the webpage to be identified includes hostile content.

Further include pre-treatment step in recognition methods according to the present invention: extracting the URL of webpage to be identified, judge to Identify whether the webpage URL and URL in pre-stored data library is consistent, if webpage URL to be identified sentences in the first pre-stored data library Break the webpage to be identified include hostile content；And if webpage URL to be identified, in the second pre-stored data library, judgement should be wait know Other webpage does not include hostile content.

According to another aspect of the present invention, provide it is a kind of identification webpage in whether include hostile content method, including Step: grabbing web page contents to be identified, carries out word segmentation processing to the web page contents grabbed and obtains word sequence；According in word sequence The second feature vector that dimension is the first predetermined number is constructed with the presence or absence of the Feature Words of second feature concentration, wherein second is special The first predetermined number Feature Words have been prestored in collection；And using second feature vector described in second feature model treatment, and The second result is exported to characterize whether the webpage to be identified includes hostile content.

According to an aspect of the present invention, provide in a kind of identification webpage whether include hostile content method, including step It is rapid: the first identity information of webpage to be identified is extracted according to the URL of webpage to be identified；Extract all exterior chains of the webpage to be identified It connects；According to outer the second identity information for linking the determining webpage to be identified；And compare the first identity information and the second identity letter Breath exports third result to characterize whether the webpage to be identified includes hostile content.

According to an aspect of the present invention, provide in a kind of identification webpage whether include hostile content method, including step It is rapid: to execute recognition methods as described above to export the first result；Recognition methods as described above is executed to export the second result； Recognition methods as described above is executed to export third result；Calculation is weighted to the first result, the second result and third result Method obtains final result；If final result is greater than threshold value, determine in the webpage to be identified comprising hostile content；And if most The fruit that terminates is not more than threshold value, then determines not including hostile content in the webpage to be identified.

Correspondingly, the present invention also provides in four kinds of identification webpages corresponding with above-mentioned four kinds of recognition methods respectively whether System comprising hostile content.

Based on description above, this programme is intended to provide a kind of scheme of efficient, strong applicability identification malicious web pages, should Scheme includes following several recognition methods:

Firstly, being filtered by URL of the black and white lists to webpage to be identified；

Then, it parses the URL of webpage to be identified and extracts fisrt feature collection, the first spy is handled using machine learning model Collection exports the first result to characterize whether webpage to be identified includes hostile content；

Meanwhile second feature vector is extracted according to the web page contents of webpage to be identified, the is handled using machine learning model Two feature vectors export the second result to characterize whether webpage to be identified includes hostile content；

Alternatively, judging that webpage to be identified is by analyzing webpage to be identified and the outer webpage identity information linked of its correspondence It is no that there is imitation suspicion, and third result is exported to characterize whether webpage to be identified includes hostile content；

Finally, ranking operation can also be done for above-mentioned first result, the second result, third result, to reach more fully Identify the purpose of judgement.

In this way, this programme on the basis of traditional black and white lists recognition methods, in conjunction with machine learning model and imitates suspicion Recognition methods is doubted, while considering URL feature and web page contents, has not only solved the hysteresis quality of black and white lists identification, but also is had certain The unknown malicious websites of detection ability, also save human resources, webpage to be identified identified by automatic mode. And it is possible to which the above-mentioned recognition methods of flexible choice is combined, according to the demand of application scenarios in order to quickly and accurately know It whether include hostile content in other webpage.

Detailed description of the invention

To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical Component or element.

Fig. 1 show in identification webpage according to an embodiment of the invention whether the method 100 comprising hostile content Flow chart；

Fig. 2 shows in identification webpage according to another embodiment of the present invention whether the method 200 comprising hostile content Flow chart；

Fig. 3 show in the identification webpage of another embodiment according to the present invention whether the method 300 comprising hostile content Flow chart；

Fig. 4 show in the identification webpage of another embodiment according to the present invention whether the method 400 comprising hostile content Flow chart；

Fig. 5 show in identification webpage according to an embodiment of the invention whether the system 500 comprising hostile content Schematic diagram；

Fig. 6 show in identification webpage according to another embodiment of the present invention whether the system 600 comprising hostile content Schematic diagram；

Fig. 7 show in the identification webpage of another embodiment according to the present invention whether the system 700 comprising hostile content Schematic diagram；And

Fig. 8 show in the identification webpage of another embodiment according to the present invention whether the system 800 comprising hostile content Schematic diagram.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Fig. 1 show in identification webpage according to an embodiment of the invention whether the method 100 comprising hostile content Flow chart.

According to one embodiment of present invention, it is the recognition efficiency for improving malicious web pages, the webpage to be identified of input is done Pretreatment operation, that is, webpage to be identified is filtered using black and white lists, fall readily identified webpage out to delete choosing.Specifically, The URL for extracting webpage to be identified judges the URL in the webpage URL to be identified and pre-stored data library (i.e. blacklist and white list) It is whether consistent, if webpage URL to be identified in the first pre-stored data library (that is, blacklist), judges that the webpage to be identified includes Hostile content；If webpage URL to be identified in the second pre-stored data library (that is, white list), judges that the webpage to be identified does not wrap Containing hostile content；For the remaining webpage to be identified being not matched to, then the operation of step S110 is carried out, to continue to it point Analysis.

Code when showing black and white lists filtering as follows executes logic, and wherein whitelist refers to white list, Blacklist refers to blacklist:

By pre-treatment step, first simple screening falls webpage easy to identify, then analyzes webpage to be identified.The pre-treatment step It can be combined with other recognition methods, the invention is not limited in this regard.

In step s 110, it parses the URL of webpage to be identified and generates the first spy to extract URL feature from the URL Collection.

Each segment of URL conveys specific information to client and server, and the URL of a webpage can be decomposed It is as follows for several major parts:

Wherein the introduction of each element such as agreement (protocol), host (host), path (path) is not herein Work is unfolded.By taking following URL as an example:

Http:// www.baidu.com/path/index.hrml? q=adf

It is obtained after parsing:

Protocol:http

Host:www.baidu.com

Path:path/index.hrml? q=adf

Pathname:path/index.hrml

Query:? q=adf

Then URL feature is extracted to generate fisrt feature collection.

According to an embodiment of the invention, 18 structure features and 7 lexical features of URL are extracted altogether, as follows (with F_i Indicate ith feature):

F₁: the URL length of url_len, URL length, usual malicious web pages are all too long；

F₂: the access times of http_n, http agreement, the webpage comprising hostile content, such as fishing link would generally be more It is secondary to use http agreement, link guiding is changed with this, by the designed fishing website of user guiding, e.g., http: // Www.taobao.com/url? q=http: //www.59adfadss123.com, which seems to guiding Taobao Homepage, and in fact when the user clicks when can be redirected to subsequent fishing website up.Therefore, http agreement is used for multiple times Link be likely to be fishing link；

F₃: whether tld_inht, top level domain are legal, wherein indicate legal with 1,0 indicates illegal；

F₄: whether is_ip contains IP address in link, and the link for usually containing IP address is likely to fishing link, and Legal link substantially will not include IP address, and equally, use 1 indicates it is that 0 indicates no；

F₅And F₆It indicates the number containing designated character in URL link, is respectively as follows:

F₅: url_n_percent, character ' % ' number in link, usually contain ' URL of % ' compiled using unicode Code, e.g.,

Http:// www.taobao.com@%77%77%77%2E%70%68%69%73%68%2E%63% 6F%6D；

F₆: url_n_token, in link containing ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators；

F₇: host_len, the length of host character string；

F₈And F₉It indicates the number containing designated character in host character string, is respectively as follows:

F₈: host_n_dot, host character string contain the number of point number separator；

F₉: host_n_token, host character string contains ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators；

F₁₀: host_max_len, length of the host character string by longest character string after the segmentation of point number separator, such as www.t Aobao.1242.59adfadss123.com divide according to point number after character string are as follows: " www ", " taobao ", " 1242 ", " 59adfadss123 ", " com ", wherein F₁₀=12；

F₁₁And F₁₂It indicates the number containing designated character in path, is respectively as follows:

F₁₁: path_n_dot, the number containing point number separator in path；

F₁₂: path_n_token, in path containing ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators；

F₁₃: pathname_len, the length of pathname；

F₁₄And F₁₅It indicates the number containing designated character in pathname, is respectively as follows:

F₁₄: pathname_n_dot, the number containing point number separator in pathname；

F₁₅: pathname_n_token, in pathname containing ' _ ', '-', ' & ', ' # ', '? ' wait the numbers of separators；

F₁₆: pathname_max_len, pathname by '/' segmentation after longest character string length, same to F₁₀；

F₁₇: n_subdir, pathdepth, with the depth in '/' characterization path in pathname, the link of usual malice all passes through Deepen path and carrys out confusing user；

F₁₈: the length of query_len, query field；

F₁₉~F₂₅: respectively indicate in URL link whether comprising character string " secure ", " account ", " webscr ", " Login ", " signin ", " banking ", " confirm ", usual malicious link can include these character strings.

The present embodiment has been merely given as an example of fisrt feature collection, fisrt feature collection may include it is above-mentioned at least one URL feature can also extract other URL features, the invention is not limited in this regard.

Then in the step s 120, first eigenvector is generated according to above-mentioned fisrt feature collection.

A) each feature first concentrated to fisrt feature is quantized to obtain characteristic value, by all eigenvalue clusters at one Feature vector.By taking 25 URL features above as an example, for following URL:

Http:// www.dyfdzx.com/js/? app=com-d3&amp；Ref=http: // jebvahnus.battle.net/d3/en/index

Extract F₁To F₂₅Feature obtains characteristic value, forms the feature vector of one 25 dimension

B) every one-dimensional characteristic value in features described above vector is normalized again, generates first eigenvector.

According to one embodiment of present invention, as follows normalize to every one-dimensional characteristic value in feature vector [- 1,1] between:

Wherein, F_iFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, F_i,maxFor the maximum of i-th dimension characteristic value Value, F_i,minFor the minimum value of i-th dimension characteristic value.

Therefore, the feature vector that step a) is generatedAfter normalizing are as follows:

Then in step s 130, using fisrt feature model come first eigenvector obtained in processing step S120, And the first result is exported to characterize whether the webpage to be identified includes hostile content.

Embodiment according to the present invention classifies to first eigenvector using algorithm of support vector machine (SVM), defeated 0 or 1 conduct first if exporting the first result and being 1 as a result, specifically, indicate that webpage to be identified includes hostile content out；If defeated The first result, which is 0, out indicates that webpage to be identified does not include hostile content.

Support vector machines (Support Vector Machine, SVM) is a kind of engineering based on Statistical Learning Theory Learning method, core are to find a hyperplane (hyperplane) to separate training data, guarantee this hyperplane The interval (margin) of two sides is maximum, that is to say, that SVM algorithm is extensive to improve learning machine by seeking structuring least risk Ability realizes that the minimum of empiric risk and fiducial range can also obtain to reach in the case where statistical sample amount is less The purpose of good statistical law.Theoretically it is a binary classifier, but can be expanded into multivariate classification device.It should infuse Meaning, characteristic model (for example, fisrt feature model) of the present invention for training are not only restricted to this.

For example, for webpage A, URL to be identified are as follows:

http://ssol.iitk.ac.in/wp-content/onlineinformationnabaustralia/ Informationsec ureonline/login.php? NAB82515Reset-Online-Account7137

It extracts its URL feature and generates feature vector are as follows:

First eigenvector is obtained through normalization again:

It willFisrt feature model is inputted, the first result of output is 1, indicates that webpage A includes hostile content.

For another example, for webpage B, URL to be identified are as follows:

http://www.annyway.com/annyway/MMSC.84+M5d637b1e38d.0.html

It extracts its URL feature and generates feature vector are as follows:

First eigenvector is obtained through normalization again:

It willAfter inputting fisrt feature model, the first result of output is 0, indicates that webpage B does not include hostile content.

Implementation according to the present invention, the recognition methods 100 further include the steps that trained fisrt feature model:

(1) the URL work for largely having been marked as the webpage not comprising hostile content and the webpage comprising hostile content is chosen For sample data, and the operation of step S110 is executed to sample data, obtain the fisrt feature collection of sample data.

(2) with step S120, corresponding first eigenvector is generated according to the fisrt feature collection of sample data, as training Parameter.

(3) it using the training parameter in machine learning algorithm (algorithm of support vector machine) training step (2), obtains original Classification learning model SVM-Model, i.e. fisrt feature model.

According to an embodiment of the invention, the recognition methods 100 further includes online for the variability of reply malicious websites attack The step of updating fisrt feature model: updating sample data in the given time, then execute above-mentioned steps (1), (2), generates new Sample data first eigenvector, the first eigenvector of update input fisrt feature model is trained, is generated new Fisrt feature model and replace old fisrt feature model.

Furthermore since malicious link often changes, this programme can be also updated the generating algorithm of first eigenvector, Such as the dimension ... for increasing new URL feature, deleting some existing URL feature, changing first eigenvector

According to the above-mentioned description to recognition methods 100, the URL of webpage to be identified is parsed to extract fisrt feature collection, then will The corresponding first eigenvector of fisrt feature collection is input in fisrt feature model, obtains the sky of feature belonging to webpage to be identified Between, to judge whether this feature space belongs to the feature space of the webpage comprising hostile content, if so, output 1 indicates the net Page includes hostile content.Method 100 is not necessarily to manual identified URL, does not also need manually to lay down a regulation, to save manpower.Separately Outside, it is contemplated that the variability of malicious websites, timing update fisrt feature model, also improve lacking for existing recognition methods lag Point.

Fig. 2 shows in identification webpage according to another embodiment of the present invention whether the method 200 comprising hostile content Flow chart.As shown in Fig. 2, the recognition methods 200 includes the following steps:

In step S210, web page contents to be identified are grabbed, word segmentation processing is carried out to the web page contents grabbed and obtains word order Column.

According to one embodiment of present invention, web page contents are crawled using scrapy frame, then uses MMSEG Word segmentation processing is carried out to the web page contents crawled and obtains word sequence.MMSEG be in Chinese word segmentation one it is common, based on dictionary Segmentation methods have Simple visual, realize uncomplicated, the fast advantage of the speed of service.Briefly, the segmentation methods include " With algorithm " and " disambiguation rule ", wherein matching algorithm refers to how according to the word saved in dictionary, to the sentence for wanting cutting It is matched；" disambiguation rule " is says when in short can divide in this way or divide like that, with what it is regular come Determine which kind of point-score, such as " facility and service " this phrase used, is segmented into " facility/kimonos/business ", is also segmented into Which word segmentation result " facility/and/service ", select, and is exactly the function of " disambiguation rule ".In MMSEG algorithm, definition With there are two types of algorithms: simple maximum matching and complicated maximum matching；There are four types of the rules of the disambiguation of definition: maximum matching (Maximum matching, corresponding above two matching algorithm), maximum average word length (Largest average word Length), the minimum rate of change (Smallest variance of word lengths) of word length, calculate phrase in Then obtained value is added, takes the maximum phrase of summation (Largest sum of by the natural logrithm of all monosyllabic word word frequency degree of morphemic freedom of one-character words)。

Then in step S220, it is to construct dimension according to whether there is the Feature Words that second feature is concentrated in word sequence The second feature vector of first predetermined number, wherein second feature concentration has prestored the first predetermined number Feature Words.

Firstly, according to one embodiment of present invention, second feature collection takes following method to generate: obtaining preset webpage Web page contents carry out word segmentation processing to acquired web page contents and obtain word sequence, to each word in word sequence, computational chart The Second Eigenvalue for levying the word importance chooses the first predetermined number (example according to the sequence of Second Eigenvalue from high to low Such as, 500) word forms second feature collection as Feature Words.

Wherein, whether Second Eigenvalue is defined as under conditions of there is certain word, include hostile content in webpage Probability distribution and webpage whether include hostile content probability distribution distance, that is, the expectation cross entropy (Expected of word Cross Entropy), it is however generally that, the expectation intersection of word w is closely related bigger, and the ability for distinguishing sample is stronger, it is expected that cross entropy Calculation formula it is as follows:

Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, P (phish) refer to the probability of fishing webpage, P (nophish | w) refers to that webpage to be identified is not Fishing net under conditions of word w occurs The probability of page, P (nophish) refer to the probability of non-fishing webpage.

Then, include: the step of second feature vector to construct with the presence or absence of Feature Words according in word sequence

1. sequentially searching whether there is the specific word for each Feature Words that second feature is concentrated in word sequence:

If there are the specific words in word sequence, the value of corresponding position is concentrated to be assigned to 1 in second feature the specific word；

If the specific word is not present in word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word 0。

2. generating the second feature vector that dimension is the first predetermined number, example according to the assignment of Feature Words corresponding position Such as, N number of word is chosen as Feature Words (embodiment according to the present invention, N generally take between 450~550), then second is special Sign vector can indicate are as follows:

Then in step S230, the second feature vector generated using second feature model treatment step S220, and it is defeated The second result is out to characterize whether webpage to be identified includes hostile content.According to an embodiment of the invention, if the second result of output Indicate that webpage to be identified includes hostile content for 1；Indicate that webpage to be identified does not include in malice if exporting the second result and being 0 Hold.

With the step of like that, which also includes training second feature model described in recognition methods 100:

(1) webpage for largely having been labeled as the webpage comprising hostile content and the webpage not comprising hostile content is chosen Content is as sample data, in step S210, carries out word segmentation processing to the web page contents grabbed and obtains word sequence.

(2) Feature Words concentrated according to second feature execute the operation in step S220, generate the net as sample data The second feature vector of page content is as training parameter.

(3) it using the training parameter in machine learning algorithm (support vector machine method) training step (2), obtains original Classification learning model SVM-Model, i.e. second feature model.

Similarly, which further includes the steps that online updating second feature model: updating in the given time Above-mentioned sample data repeats the training step of (2), (3), replaces original second feature mould to generate new second feature model Type.

According to the above-mentioned description to recognition methods 200, recognition methods 200 is different from traditional key based on web page contents Word scan method --- as long as scoring simply is weighted to each keyword, but by the web page contents vectorization of crawl, Then webpage is sorted out automatically with machine learning algorithm, to improve the accuracy of webpage identification.

In general, the topological structure of malicious websites is simple and the domain name of exterior chain and itself domain name are inconsistent, it is based on this point, The present invention provides in another webpage for identification whether include hostile content method.As shown in figure 3, the recognition methods 300 Mainly judge whether the webpage contains hostile content by the outer number of links of webpage to be identified and webpage identity.

This method 300 starts from step S310, is believed according to the first identity that the URL of webpage to be identified extracts webpage to be identified Breath.Specifically, the URL for parsing webpage to be identified first obtains the domain name of webpage to be identified, then using the domain name as this wait know First identity information of other webpage.Such as the URL of webpage to be identified is:

http://likersgames.netne.net/

It is netne.net that parsing URL, which obtains its domain name, therefore the first identity information of the webpage to be identified is netne.net.

Then in step s 320, all outer links of the webpage to be identified are extracted.

For popular, outer link exactly refers to the link that oneself website is imported into from other website.It can be according to URL link Html web page, extracts its all outer link, and the present invention is to extracting the outer method linked and with no restriction.

Then in step S330, is fetched according to all exterior chains extracted and determine that the second identity of the webpage to be identified is believed Breath.According to one embodiment of present invention, the corresponding all exterior chains of the webpage to be identified are counted and pick out existing number, use appearance Second identity information of the most outer link domain name of number as webpage.Or by taking the URL in step S310 as an example, extract Outer link and outer number of links are respectively as follows:

000webhost.com:16

serviceuptime.com:1

hosting24.com:5

So the second identity information of the webpage to be identified are as follows: 000webhost.com.

In step S340, compare the first identity information (being obtained by step S310) and the second identity information (by step S330 is obtained), third result is exported to characterize whether the webpage to be identified includes hostile content.

For URL above, the first identity information (netne.net) and the second identity information (000webhost.com) It is not consistent, therefore exporting third result is 1, is indicated in the webpage to be identified comprising hostile content.Conversely, if the second identity information with First identity information is consistent, then exporting third result is 0, indicates not including hostile content in the webpage to be identified.

The URL of webpage to be identified for another example are as follows:

http://www.baidu.com

The URL is parsed, obtains the first identity information are as follows: baidu.com；

Extract the outer link and outer number of links that it contains are as follows:

bdstatic.com:5

hao123.com:2

baidu.com:27

Obtain the second identity information are as follows: baidu.com；

Second identity information and the first identity information are identical, therefore export third result 0, judge that the webpage to be identified does not include Hostile content.

To sum up, recognition methods 100, recognition methods 200, recognition methods 300 respectively illustrate identification malicious web pages (comprising disliking Anticipate content webpage) 3 kinds of methods: the URL of 100 analyzing web page of recognition methods, extract URL feature simultaneously use machine learning model Classify to webpage；Recognition methods 200 grabs web page contents, according to preset Feature Words by web page contents vectorization, and adopts With machine learning model to Web page classifying；Webpage identity is analyzed in recognition methods 300, to identify the evil with imitation suspicion Meaning webpage.Whether above 3 kinds of methods are identified in webpage from different angles comprising hostile content, a reality according to the present invention Example is applied, can be in conjunction with above-mentioned 3 kinds of recognition methods, whether comprehensive analysis webpage to be identified includes hostile content, i.e. recognition methods 400。

The flow chart of the recognition methods 400 is as shown in Figure 4.As previously mentioned, recognition methods 400 is filtered in traditional black and white lists On the basis of, comprehensively consider the URL feature and content characteristic of webpage, while in view of skill is pretended in the used imitation having of malicious websites Art analyzes webpage identity to identify the malicious web pages with imitation suspicion；In implementation method, using machine learning model to net Page is classified；Not only it had solved the hysteresis quality disadvantage of traditional recognition method, but also has had the ability of certain unknown malicious web pages of detection, Improve the accuracy of identification.

Specifically, the step of recognition methods 400, is as follows:

In step S410, recognition methods 100 as shown in Figure 1 is executed to export the first result.

In step S420, recognition methods 200 as shown in Figure 2 is executed to export the second result.

In step S430, recognition methods 300 as shown in Figure 3 is executed to export third result.

Then, in step S440, algorithm is weighted to above-mentioned first result, the second result and third result, is obtained Final result, and judged:

If final result is greater than threshold value (in the present embodiment, threshold value 0.5), then determine in the webpage to be identified comprising disliking Meaning content；

If final result is not more than threshold value, determine not including hostile content in the webpage to be identified.

It according to one embodiment of present invention, can be using simple weighting algorithm to the first result (r1), the second result (r2) it carries out calculation process with third result (r3) and obtains final result (r):

R=w₁×r₁+w₂×r₂+w₃×r₃

Wherein, w₁、w₂And w₃The first result, the second result, the corresponding weight of third result are respectively represented, and according to this hair Bright one embodiment distinguishes value 0.4,0.4,0.2.

Correspondingly, Fig. 5 to Fig. 8 shows the identification according to an embodiment of the present invention for realizing 4 kinds of recognition methods as above System will be introduced respectively below.

Fig. 5 show in identification webpage according to an embodiment of the invention whether the system 500 comprising hostile content Schematic diagram.The system 500 includes including at least URL extractor 510, fisrt feature extractor 520 and the first recognition unit 530.

According to a kind of implementation, system 500 further includes judging filter element 540, be suitable for judge webpage URL to be identified and Whether the URL in pre-stored data library is consistent:

If webpage URL to be identified in the first pre-stored data library (that is, blacklist), judges that the webpage to be identified includes to dislike Meaning content；And

If webpage URL to be identified in the second pre-stored data library (that is, white list), judges that the webpage to be identified does not include Hostile content.

For by above-mentioned black and white lists it is unidentified go out URL, then send it to URL extractor 510.

URL extractor 510 is suitable for parsing the URL of webpage to be identified.

Fisrt feature extractor 520 is suitable for extracting URL feature from the URL identified to generate fisrt feature collection.According to One embodiment of the present of invention, fisrt feature collection include one or more in following: URL length, http agreement use secondary Whether number, top level domain are legal, whether include number, host string length, host containing designated character in IP address, URL The length of longest character string in number, host character string in character string containing designated character, the number in path containing designated character, The length of longest character string, pathdepth, inquiry ginseng in number, pathname in pathname length, pathname containing designated character In digital segment length, URL whether string containing designated character.To being discussed in detail referring to the description based on Fig. 1 for each feature.

Fisrt feature extractor 520 is further adapted for generating first eigenvector according to fisrt feature collection.One according to the present invention Embodiment, fisrt feature extractor 520 include numeralization subelement 522 and normalization subelement 524.

Numeralization subelement 522 is suitable for being quantized to obtain characteristic value to each feature that fisrt feature is concentrated, will be special Value indicative forms a feature vector.

Normalization subelement 524 is suitable for every one-dimensional characteristic value in the feature vector after logarithm value and place is normalized Reason generates first eigenvector.Such as normalization subelement 524 is configured as normalizing every one-dimensional characteristic value of feature vector To between [- 1,1]:

First recognition unit 530 is suitable for handling first eigenvector using fisrt feature model, export the first result with Characterize whether webpage to be identified includes hostile content.Wherein, if the first result of output is 1, then it represents that webpage to be identified includes Hostile content；If the first result of output is 0, then it represents that webpage to be identified does not include hostile content.

According to an embodiment of the invention, system 500 is additionally configured to execute the operation of trained fisrt feature model.

Wherein, URL extractor 510 is further adapted for the webpage extracted largely have been marked as not comprising hostile content and comprising disliking The URL of the webpage for content of anticipating is as sample data.Fisrt feature extractor 520 is further adapted for forming fisrt feature according to above-mentioned URL Collection, and corresponding first eigenvector is generated according to fisrt feature collection, as training parameter.In addition, system 500 further includes and the First training unit 550 of one feature extractor, 520 phase coupling, is suitable for using machine learning algorithm (for example, support vector machines side Method SVM) training parameter extracted by fisrt feature extractor 520 is trained, obtain fisrt feature model.

In the present embodiment, in order to cope with the variability that malicious websites are attacked, system 500 can also include that the first update is single Member 560, suitable for updating sample data in the given time, generating the first eigenvector of new sample data and updating First eigenvector input fisrt feature model be trained, to regularly update fisrt feature model.

Furthermore the first updating unit 560 is further adapted for the feature by increasing, deleting fisrt feature concentration, it is special to change first The dimension of vector is levied, to generate new first eigenvector.

Fig. 6 show in identification webpage according to another embodiment of the present invention whether the system 600 comprising hostile content Schematic diagram.The system 600 includes at least: page analyzer 610, second feature extractor 620 and the second recognition unit 630.

Page analyzer 610 is suitable for grabbing web page contents to be identified, carries out word segmentation processing to the web page contents grabbed and obtains To word sequence.It include the participle suitable for web page contents are carried out with word segmentation processing according to a kind of implementation, in page analyzer 610 Device, the segmenter are suitable for carrying out word segmentation processing to web page contents using the segmentation methods based on dictionary, and wherein segmentation methods can be with It is the MMSEG algorithm of the rule comprising a dictionary, two kinds of matching algorithms and four disambiguations.

Page analyzer 610 is further adapted for obtaining the web page contents of preset webpage, and divides acquired web page contents Word handles to obtain word sequence.

Second feature extractor 620 is suitable for constructing dimension according to whether there is the Feature Words that second feature is concentrated in word sequence Degree is the second feature vector of the first predetermined number (for example, choosing the first predetermined number between 450-550), wherein second is special The first predetermined number Feature Words have been prestored in collection.

According to the implementation, second feature extractor 620 further includes coupling subelement 622.Coupling subelement 622 is suitable for To each Feature Words that second feature is concentrated, sequentially searching whether there is the specific word in word sequence:

If being matched to certain Feature Words in word sequence, the value of corresponding position is concentrated to assign in second feature the specific word It is 1；

If not being matched to certain Feature Words in word sequence, the specific word is concentrated to the value of corresponding position in second feature It is assigned to 0.

Second feature extractor 620 is further adapted for generating dimension according to the assignment of Feature Words corresponding position being the first predetermined number Purpose second feature vector.

The system 600 further includes feature set generation unit 640, suitable for each word in word sequence, computational representation should The Second Eigenvalue of word importance simultaneously chooses the first predetermined number word work according to the sequence of Second Eigenvalue from high to low It is characterized word, forms second feature collection.Wherein, Second Eigenvalue be defined as be in webpage under conditions of there is certain word The no probability distribution comprising hostile content and webpage whether include hostile content probability distribution distance, the phase of word can be used Cross entropy is hoped to indicate:

Second recognition unit 630 is suitable for using second feature vector described in second feature model treatment, and exports the second knot Fruit is to characterize whether the webpage to be identified includes hostile content.Wherein, if the second result of output is 1, then it represents that be identified Webpage includes hostile content；If the second result of output is 0, then it represents that webpage to be identified does not include hostile content.

As homologous ray 500, system 600 is also arranged as the operation for executing training second feature model.At this point, webpage point Parser 610 is further adapted for the webpage that crawl largely has been marked as the webpage not comprising hostile content and the webpage comprising hostile content Content is as sample data.Second feature extractor 620 is further adapted for the Feature Words concentrated according to second feature, generates and is used as sample The second feature vector of the web page contents of data is as training parameter.In addition, system 600 further includes the second training unit 650, fit In using the machine learning algorithm training training parameter, second feature model is obtained.

Furthermore in order to cope with the variability of malicious websites attack, system 600 further includes the second updating unit 660, it is suitable for Sample data, repetition training step, to regularly update second feature model are updated in predetermined time.

Fig. 7 show in the identification webpage of another embodiment according to the present invention whether the system 700 comprising hostile content Schematic diagram.The system 700 includes: first information acquiring unit 710, the second information acquisition unit 720 and third recognition unit 730。

The first identity that first information acquiring unit 710 is suitable for extracting webpage to be identified according to the URL of webpage to be identified is believed Breath.Specifically, first information acquiring unit 710 is suitable for parsing the URL of webpage to be identified, obtains the domain name, simultaneously of webpage to be identified And using the domain name as the first identity information of the webpage to be identified.

Second information acquisition unit 720 is suitable for extracting all outer links of the webpage to be identified, and is determined according to outer link Second identity information of the webpage to be identified.According to a kind of implementation, the second information acquisition unit 720 may include statistics Unit 722, all exterior chains suitable for counting the webpage to be identified extracted pick out existing number, the second information acquisition unit 720, suitable for choosing the domain name of the most outer link of frequency of occurrence as the second identity information.Such as following URL: Http:// www.baidu.com, extracting its outer link is respectively that bdstatic.com (occurring 5 times), baidu.com (occur 27 times), that determines that baidu.com is the second identity information of the URL.

Third recognition unit 730 is adapted to compare the first identity information and the second identity information, exports third result to characterize Whether the webpage to be identified includes hostile content.Specifically, if the second identity information is not consistent with the first identity information, it is defeated Third result is 1 out, is indicated in the webpage to be identified comprising hostile content；If the second identity information and the first identity information phase Symbol, then exporting third result is 0, indicates not including hostile content in the webpage to be identified.

Fig. 8 show in the identification webpage of another embodiment according to the present invention whether the system 800 comprising hostile content Schematic diagram.The system 800 combines above system 500, system 600, system 700 and weighted units 810 and the 4th identification is single Member 820.

Identifying system 500 is suitable for the first result of output；

Identifying system 600 is suitable for the second result of output；

Identifying system 700 is suitable for output third result；

Weighted units 810 are suitable for being weighted algorithm to the first result, the second result and third result, are most terminated Fruit.

R=w₁×r₁+w₂×r₂+w₃×r₃

If the 4th recognition unit 820 is suitable for final result and is greater than threshold value (for example, 0.5), the webpage to be identified is identified In include hostile content, if final result be not more than threshold value, identify in the webpage to be identified do not include hostile content.

For identifying system 800 on the basis of traditional black and white lists filter, URL feature and the content for comprehensively considering webpage are special Sign, while in view of the used imitation camouflage having of malicious websites, webpage identity is analyzed to identify the malice with imitation suspicion Webpage

In implementation method, is classified using machine learning model to webpage, both solved the stagnant of traditional recognition method Property disadvantage afterwards, and have the ability of certain unknown malicious web pages of detection, to improve the accuracy of identification.

It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right above In the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hair Bright requirement is than feature more features expressly recited in each claim.More precisely, as the following claims As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hair Bright separate embodiments.

Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

A3, the method as described in A1 or 2, wherein fisrt feature collection includes one or more in following: URL length, Whether http agreement access times, top level domain are legal, whether include the number in IP address, URL containing designated character, host word Accord with string length, the number in host character string containing designated character, the length of longest character string in host character string, in path containing referring to Determine number in the number, pathname length, pathname of character containing designated character, the length of longest character string, road in pathname Diameter depth, query argument field length, in URL whether string containing designated character.A4, the method as described in any one of A1-3, Middle the step of first eigenvector is generated according to fisrt feature collection further include: numerical value is carried out to each feature that fisrt feature is concentrated Change obtains characteristic value, by the eigenvalue cluster at a feature vector；And every one-dimensional characteristic value in feature vector is carried out Normalized generates first eigenvector.A5, the method as described in A4, wherein normalized step include: by feature to Every one-dimensional characteristic value of amount normalizes between [- 1,1]:

A6, the method as described in any one of A1-5 further include the steps that trained fisrt feature model: choosing largely Labeled as the URL of the webpage not comprising hostile content and the webpage comprising hostile content as sample data, and according to above-mentioned URL Form fisrt feature collection；Corresponding first eigenvector is generated according to the fisrt feature collection of sample data, as training parameter；With And using machine learning algorithm training training parameter, obtain fisrt feature model.A7, the method as described in A6, further comprise the steps of: Sample data is updated in the given time, generates the first eigenvector of new sample data；And by the fisrt feature of update Vector input fisrt feature model is trained, to regularly update fisrt feature model.A8, the method as described in A7, wherein giving birth to The step of first eigenvector of the sample data of Cheng Xin further include: by increasing, deleting the feature of fisrt feature concentration, to change Become the dimension of first eigenvector.A9, the method as described in any one of A1-8, wherein it is to be identified to characterize to export the first result The step of whether webpage includes hostile content includes: to indicate that webpage to be identified includes hostile content if exporting the first result and being 1； Indicate that webpage to be identified does not include hostile content with if exporting the first result and being 0.A10, the side as described in any one of A6-9 Method, wherein machine learning algorithm is support vector machine method.

B13, the method as described in B11 or 12, wherein constructing second feature according to whether there is Feature Words in word sequence The step of vector includes: each Feature Words concentrated for second feature, and sequentially searching whether there is the specific word in word sequence； If there are some Feature Words in word sequence, the value of corresponding position is concentrated to be assigned to 1 in second feature the specific word；If word order Certain Feature Words is not present in column, then concentrates the value of corresponding position to be assigned to 0 in second feature the specific word；And according to feature The assignment of word corresponding position generates the second feature vector that dimension is the first predetermined number.B14, such as any one of B11-13 institute The method stated, wherein second feature collection is generated using the following steps: the web page contents of preset webpage is obtained, to acquired webpage Content carries out word segmentation processing and obtains word sequence；To each word in word sequence, the second of the computational representation word importance is special Value indicative；And the first predetermined number word is chosen as Feature Words according to Second Eigenvalue, form second feature collection.B15, such as Whether method described in B14, wherein Second Eigenvalue is defined as under conditions of there is certain word, comprising in malice in webpage The probability distribution and webpage of appearance whether include hostile content probability distribution distance.B16, the method as described in B15, wherein second Characteristic value is the expectation cross entropy CE (w) of word w:

B17, the method as described in any one of B14-16, wherein choosing the first predetermined number word according to Second Eigenvalue The step of language composition second feature set includes: to choose the first predetermined number word according to the sequence of Second Eigenvalue from high to low Language constitutes second feature collection as Feature Words.B18, the method as described in any one of B11-17 further include trained second feature The step of model: the webpage for largely having been labeled as the webpage comprising hostile content and the webpage not comprising hostile content is chosen Content is as sample data；According to the Feature Words that second feature is concentrated, generate the web page contents as sample data second is special Vector is levied as training parameter；And using the machine learning algorithm training training parameter, obtain second feature model.B19, Method as claimed in claim 18 further comprises the steps of: and updates sample data in the given time, repetition training step, with fixed Phase updates second feature model.B20, the method as described in any one of B11-19, wherein the first predetermined number 450-550 it Between.B21, the method as described in any one of B11-20, wherein exporting the second result to characterize whether webpage to be identified includes to dislike The step of content of anticipating includes: to indicate that webpage to be identified includes hostile content if exporting the second result and being 1；If with the second knot of output Fruit, which is 0, indicates that webpage to be identified does not include hostile content.B22, the method as described in any one of B18-21, wherein engineering Practising algorithm is support vector machine method.

C24, the method as described in C23, wherein the step of extracting the first identity information of webpage to be identified include: parsing to The URL for identifying webpage, obtains the domain name of webpage to be identified；And using domain name as the first identity information of the webpage to be identified. C25, the method as described in C23 or 24, wherein the step of determining the second identity information according to outer link includes: that this is to be identified for statistics The corresponding all exterior chains of webpage pick out existing number；And the domain name of the most outer link of frequency of occurrence is chosen as the second identity Information.C26, the method as described in any one of C23-25, wherein compare the first identity information and the second identity information, output the If the step of three results includes: that the second identity information is not consistent with the first identity information, exporting third result is 1, and indicating should It include hostile content in webpage to be identified；And if the second identity information is consistent with the first identity information, exports third result It is 0, indicates not including hostile content in the webpage to be identified.

D28, the method as described in D27, wherein the corresponding weight factor difference of the first result, the second result, third result It is 0.4,0.4 and 0.2；And threshold value is 0.5.

E30, the system as described in E29, further includes: filter element is judged, suitable for judging webpage URL to be identified and prestoring number It is whether consistent according to the URL in library, if webpage URL to be identified in the first pre-stored data library, judges that the webpage to be identified includes Hostile content；And if webpage URL to be identified judges that the webpage to be identified does not include in malice in the second pre-stored data library Hold.E31, the system as described in E29 or 30, wherein fisrt feature collection includes one or more in following: URL length, Whether http agreement access times, top level domain are legal, whether include the number in IP address, URL containing designated character, host word Accord with string length, the number in host character string containing designated character, the length of longest character string in host character string, in path containing referring to Determine number in the number, pathname length, pathname of character containing designated character, the length of longest character string, road in pathname Diameter depth, query argument field length, in URL whether string containing designated character.E32, the system as described in any one of E29-31, Wherein fisrt feature extractor includes: numeralization subelement, suitable for quantize to each feature that fisrt feature is concentrated To characteristic value, by eigenvalue cluster at a feature vector；And normalization subelement, suitable in the feature vector after logarithm value Every one-dimensional characteristic value be normalized, generate first eigenvector.E33, the system as described in E32, wherein normalizing Subelement is configured as normalizing to every one-dimensional characteristic value of feature vector between [- 1,1]:

E34, the system as described in any one of E29-33, wherein URL extractor, which is further adapted for extracting, largely to be had been marked as The URL of webpage not comprising hostile content and the webpage comprising hostile content is as sample data；Fisrt feature extractor is also suitable In forming fisrt feature collection according to above-mentioned URL, and corresponding first eigenvector is generated according to fisrt feature collection, joined as training Number；And system further includes the first training unit, is suitable for obtaining fisrt feature mould using machine learning algorithm training training parameter Type.E35, the system as described in E34, further includes: the first updating unit is generated suitable for updating sample data in the given time The first eigenvector of new sample data and by the first eigenvector of update input fisrt feature model be trained, To regularly update fisrt feature model.E36, the system as described in E35, wherein the first updating unit is further adapted for by increasing, deleting Except the feature that fisrt feature is concentrated, change the dimension of first eigenvector, to generate new first eigenvector.E37, such as E29- System described in any one of 36, wherein if the first result of output is 1, then it represents that webpage to be identified includes hostile content；With If the first result of output is 0, then it represents that webpage to be identified does not include hostile content.E38, as described in any one of E34-37 System, wherein machine learning algorithm is support vector machine method.

F40, the system as described in F39, wherein page analyzer further include: segmenter, suitable for using point based on dictionary Word algorithm carries out word segmentation processing to web page contents, and wherein segmentation methods include a dictionary, two kinds of matching algorithms and four eliminations The rule of ambiguity.F41, the system as described in F39 or 40, wherein second feature extractor includes: coupling subelement, is suitable for the Each Feature Words in two feature sets, sequentially searching whether there is the specific word in word sequence, if being matched to certain in word sequence The specific word is then concentrated the value of corresponding position to be assigned to 1 by Feature Words in second feature, if not being matched to certain spy in word sequence Word is levied, then concentrates the value of corresponding position to be assigned to 0 in second feature the specific word；And second feature extractor is further adapted for root The second feature vector that dimension is the first predetermined number is generated according to the assignment of Feature Words corresponding position.Appoint in F42, such as F39-41 System described in one, wherein page analyzer is further adapted for obtaining the web page contents of preset webpage, and in acquired webpage Hold progress word segmentation processing and obtains word sequence；System further include: feature set generation unit, suitable for each word in word sequence, The Second Eigenvalue of the computational representation word importance simultaneously chooses the first predetermined number word as special according to Second Eigenvalue Word is levied, second feature collection is formed.F43, the system as described in F42, wherein Second Eigenvalue is defined as certain word occurring Under the conditions of, in webpage whether comprising hostile content probability distribution and webpage whether include hostile content probability distribution away from From.F44, the system as described in F43, wherein Second Eigenvalue is the expectation cross entropy CE (w) of word w:

F45, the system as described in any one of F42-44, wherein feature set generation unit is configured as according to second feature The sequence of value from high to low chooses the first predetermined number word as Feature Words, constitutes second feature collection.F46, such as F39-45 Any one of described in system, wherein page analyzer is further adapted for crawl and largely has been marked as the webpages not comprising hostile content Web page contents with the webpage comprising hostile content are as sample data；Second feature extractor is further adapted for according to second feature collection In Feature Words, generate as sample data web page contents second feature vector as training parameter；And system is also wrapped The second training unit is included, is suitable for obtaining second feature model using machine learning algorithm training training parameter.F47, such as F46 institute The system stated, further includes: the second updating unit, suitable for updating sample data, repetition training step, with regular in the given time Update second feature model.F48, the system as described in any one of F39-47, wherein the first predetermined number 450-550 it Between.F49, the system as described in any one of F39-48, wherein if the second result of output is 1, then it represents that webpage packet to be identified Containing hostile content；If the second result with output is 0, then it represents that webpage to be identified does not include hostile content.F50, such as F46-49 Any one of described in system, wherein machine learning algorithm is support vector machine method.

G52, the system as described in G51, wherein first information acquiring unit is further adapted for parsing the URL of webpage to be identified, obtains Take the domain name of webpage to be identified and using domain name as the first identity information of the webpage to be identified.G53, as described in G51 or 52 System, wherein the second information acquisition unit further include: statistics subelement, suitable for counting the institute of the webpage to be identified extracted There is exterior chain to pick out existing number；And second information acquisition unit be further adapted for choosing the domain name of the most outer link of frequency of occurrence and make For the second identity information.G54, the system as described in any one of G51-53, wherein third recognition unit is suitable for: if the second identity Information is not consistent with the first identity information, then exporting third result is 1, indicates in the webpage to be identified comprising hostile content；With And if the second identity information is consistent with the first identity information, export third result be 0, indicate not including in the webpage to be identified Hostile content.

H56, the system as described in H55, wherein the corresponding weight factor difference of the first result, the second result, third result It is 0.4,0.4 and 0.2；And threshold value is 0.5.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by Function.

As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.

Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this Invent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims

1. it is a kind of identification webpage in whether include hostile content method, the method includes the steps:

Execute the first recognition methods with export first as a result, wherein the first recognition methods include:

The URL of webpage to be identified is parsed to extract URL feature from the URL to generate fisrt feature collection；

First eigenvector is generated according to the fisrt feature collection, wherein numerical value is carried out to each feature that fisrt feature is concentrated Change obtains characteristic value, by the eigenvalue cluster at a feature vector, carries out to every one-dimensional characteristic value in described eigenvector Normalized generates first eigenvector；And

The first eigenvector is handled using fisrt feature model, and exports the first result to characterize the webpage to be identified It whether include hostile content；

Execute the second recognition methods with export second as a result, wherein the second recognition methods include:

Web page contents to be identified are grabbed, word segmentation processing is carried out to the web page contents grabbed and obtains word sequence；

That dimension is the first predetermined number is constructed according to whether there is the Feature Words that second feature is concentrated in the word sequence Two feature vectors, wherein second feature concentration has prestored the first predetermined number Feature Words；

Using second feature vector described in second feature model treatment and second is exported as a result, being to characterize the webpage to be identified No includes hostile content；

Third recognition methods is executed to export third as a result, wherein third recognition methods includes:

The first identity information of the webpage to be identified is extracted according to the URL of webpage to be identified；

Extract all outer links of the webpage to be identified；

According to outer the second identity information for linking the determining webpage to be identified；

Compare the first identity information and the second identity information and exports third as a result, to characterize whether the webpage to be identified includes to dislike Meaning content；

Algorithm is weighted to first result, the second result and third result, obtains final result；

If the final result is greater than threshold value, determine in the webpage to be identified comprising hostile content；And

If the final result is not more than threshold value, determine not including hostile content in the webpage to be identified.

2. the method as described in claim 1, wherein first result, the second result, the corresponding weight factor of third result Respectively 0.4,0.4 and 0.2；And

The threshold value is 0.5.

3. method according to claim 2 further includes pre-treatment step:

The URL for extracting webpage to be identified judges whether the webpage URL to be identified and the URL in pre-stored data library are consistent,

If the webpage URL to be identified in the first pre-stored data library, judges that the webpage to be identified includes hostile content；And

If the webpage URL to be identified in the second pre-stored data library, judges that the webpage to be identified does not include hostile content.

4. method as claimed in claim 3, wherein the fisrt feature collection includes one or more in following: URL long Whether degree, http agreement access times, top level domain are legal, whether include the number in IP address, URL containing designated character, master The length of longest character string in number, host character string in machine string length, host character string containing designated character, in path The length of longest character string in number, pathname in number, pathname length, pathname containing designated character containing designated character Degree, pathdepth, query argument field length, in URL whether string containing designated character.

5. method as claimed in claim 4, wherein the normalized step includes:

Every one-dimensional characteristic value of feature vector is normalized between [- 1,1]:

Wherein, F_iFor i-th dimension characteristic value,For the average value of i-th dimension characteristic value, F_i,maxFor the maximum value of i-th dimension characteristic value, F_i,minFor the minimum value of i-th dimension characteristic value.

6. method as claimed in claim 5 further includes the steps that trained fisrt feature model:

It chooses and largely has been marked as the URL of the webpage not comprising hostile content and the webpage comprising hostile content as sample number According to, and fisrt feature collection is formed according to above-mentioned URL；

Corresponding first eigenvector is generated according to the fisrt feature collection of the sample data, as training parameter；And

Using the machine learning algorithm training training parameter, fisrt feature model is obtained.

7. method as claimed in claim 6, further comprising the steps of:

Sample data is updated in the given time, generates the first eigenvector of new sample data；And

The first eigenvector input fisrt feature model of the update is trained, to regularly update fisrt feature model.

8. it is the method for claim 7, wherein the step of first eigenvector for generating new sample data also wraps It includes:

By increasing, deleting the feature of fisrt feature concentration, to change the dimension of first eigenvector.

9. method according to claim 8, wherein the first result of the output is to characterize whether webpage to be identified includes malice The step of content includes:

Indicate that webpage to be identified includes hostile content if exporting the first result and being 1；With

Indicate that webpage to be identified does not include hostile content if exporting the first result and being 0.

10. method as claimed in claim 9, wherein the machine learning algorithm is support vector machine method.

11. method as claimed in claim 10, wherein the step of carrying out word segmentation processing to web page contents includes:

Word segmentation processing is carried out using the segmentation methods based on dictionary, wherein the segmentation methods include a dictionary, two kinds of matchings The rule of algorithm and four disambiguations.

12. method as claimed in claim 11, wherein constructed according to whether there is Feature Words in word sequence second feature to The step of amount includes:

For each Feature Words that second feature is concentrated, sequentially searching whether there is the specific word in word sequence；

If there are some Feature Words in the word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word 1；

If certain Feature Words is not present in the word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word 0；And

The second feature vector that dimension is the first predetermined number is generated according to the assignment of Feature Words corresponding position.

13. method as claimed in claim 12, wherein the second feature collection is generated using the following steps:

The web page contents for obtaining preset webpage carry out word segmentation processing to acquired web page contents and obtain word sequence；

To each word in word sequence, the Second Eigenvalue of the computational representation word importance；And

The first predetermined number word is chosen as Feature Words according to the Second Eigenvalue, forms second feature collection.

14. method as claimed in claim 13, wherein the Second Eigenvalue is defined as under conditions of there is certain word, In webpage whether comprising hostile content probability distribution and webpage whether include hostile content probability distribution distance.

15. method as claimed in claim 14, wherein the Second Eigenvalue is the expectation cross entropy CE (w) of word w:

Wherein, P (phish | w) refers to that under conditions of word w occurs webpage to be identified is the probability of fishing webpage, and P (phish) refers to The probability of fishing webpage, P (nophish | w) refer to that under conditions of word w occurs webpage to be identified is not the probability of fishing webpage, P (nophish) refers to the probability of non-fishing webpage.

16. method as claimed in claim 15, wherein described choose the first predetermined number word group according to Second Eigenvalue Include: at the step of second feature set

The first predetermined number word is chosen as Feature Words according to the sequence of Second Eigenvalue from high to low, constitutes second feature Collection.

17. the method described in claim 16 further includes the steps that trained second feature model:

Choose the web page contents work for largely having been labeled as the webpage comprising hostile content and the webpage not comprising hostile content For sample data；

According to the Feature Words that second feature is concentrated, the second feature vector of the web page contents as sample data is generated as training Parameter；And

Using the machine learning algorithm training training parameter, second feature model is obtained.

18. method as claimed in claim 17, further comprises the steps of:

Sample data, repetition training step, to regularly update second feature model are updated in the given time.

19. method as claimed in claim 18, wherein first predetermined number is between 450-550.

20. method as claimed in claim 19, wherein the second result of the output is to characterize whether webpage to be identified includes to dislike Anticipate content the step of include:

Indicate that webpage to be identified includes hostile content if exporting the second result and being 1；With

Indicate that webpage to be identified does not include hostile content if exporting the second result and being 0.

21. method as claimed in claim 20, wherein the machine learning algorithm is support vector machine method.

22. method as claimed in claim 21, wherein the step of first identity information for extracting webpage to be identified includes:

The URL of webpage to be identified is parsed, the domain name of the webpage to be identified is obtained；And

Using domain name as the first identity information of the webpage to be identified.

23. method as claimed in claim 22, wherein the step of second identity information determining according to outer link includes:

It counts the corresponding all exterior chains of the webpage to be identified and picks out existing number；And

The domain name of the most outer link of frequency of occurrence is chosen as the second identity information.

24. method as claimed in claim 23 exports third result wherein comparing the first identity information and the second identity information The step of include:

If second identity information is not consistent with the first identity information, exporting third result is 1, indicates the webpage to be identified In include hostile content；And

If second identity information is consistent with the first identity information, exporting third result is 0, is indicated in the webpage to be identified Not comprising hostile content.

25. it is a kind of identification webpage in whether include hostile content system, the system comprises:

URL extractor, suitable for parsing the URL of webpage to be identified；

Fisrt feature extractor is further adapted for suitable for extracting URL feature from the URL to generate fisrt feature collection according to first Feature set generates first eigenvector, and the fisrt feature extractor includes:

Quantize subelement, suitable for being quantized to obtain characteristic value to each feature that fisrt feature is concentrated, by the feature Value one feature vector of composition；

Subelement is normalized, is normalized suitable for every one-dimensional characteristic value in the feature vector after logarithm value, is generated First eigenvector；And

First recognition unit exports the first result suitable for handling the first eigenvector using fisrt feature model with table Levy whether the webpage to be identified includes hostile content；

Page analyzer is suitable for grabbing web page contents to be identified, carries out word segmentation processing to the web page contents grabbed and obtains word order Column,

Second feature extractor, suitable for constructing dimension according to whether there is the Feature Words that second feature is concentrated in the word sequence For the second feature vector of the first predetermined number, wherein second feature concentration has prestored the first predetermined number Feature Words,

Second recognition unit is suitable for using second feature vector described in second feature model treatment and exports the second result；With

First information acquiring unit, suitable for extracting the first identity information of the webpage to be identified according to the URL of webpage to be identified,

Second information acquisition unit, suitable for extracting all outer links of the webpage to be identified, and being determined according to outer link should be wait know Second identity information of other webpage,

Third recognition unit is adapted to compare the first identity information and the second identity information and exports third result；

Weighted units obtain final result suitable for being weighted algorithm to first result, the second result and third result； And

4th recognition unit identifies in the webpage to be identified if being suitable for the final result is greater than threshold value comprising in malice Hold, if the final result is not more than threshold value, identifies and do not include hostile content in the webpage to be identified.

26. system as claimed in claim 25, wherein first result, the second result, the corresponding weight of third result because Son is respectively 0.4,0.4 and 0.2；And

The threshold value is 0.5.

27. system as claimed in claim 26, further includes:

Judge filter element, suitable for judging whether the webpage URL to be identified and the URL in pre-stored data library are consistent,

28. system as claimed in claim 27, wherein the fisrt feature collection includes one or more in following: URL Whether length, http agreement access times, top level domain legal, whether comprising in IP address, URL containing designated character number, The length of longest character string, path in number, host character string in host string length, host character string containing designated character In number containing designated character, pathname length, the number in pathname containing designated character, in pathname longest character string length Degree, pathdepth, query argument field length, in URL whether string containing designated character.

29. system as claimed in claim 28, wherein the normalization subelement is configured as feature vector per one-dimensional Characteristic value normalization is between [- 1,1]:

30. system as claimed in claim 29, wherein

The URL extractor is further adapted for the webpage extracted largely have been marked as not comprising hostile content and comprising hostile content The URL of webpage is as sample data；

The fisrt feature extractor is further adapted for forming fisrt feature collection according to above-mentioned URL, and raw according to the fisrt feature collection At corresponding first eigenvector, as training parameter；And

The system also includes the first training unit, it is suitable for obtaining first using the machine learning algorithm training training parameter Characteristic model.

31. system as claimed in claim 30, further includes:

First updating unit, suitable for updating sample data in the given time, generate new sample data first eigenvector, And the first eigenvector of update input fisrt feature model is trained, to regularly update fisrt feature model.

32. system as claimed in claim 31, wherein

First updating unit is further adapted for changing first eigenvector by the feature for increasing, deleting fisrt feature concentration Dimension, to generate new first eigenvector.

33. system as claimed in claim 32, wherein

If the first result of the output is 1, then it represents that webpage to be identified includes hostile content；With

If the first result of the output is 0, then it represents that webpage to be identified does not include hostile content.

34. the system as described in any one of claim 30-33, wherein the machine learning algorithm is support vector machines side Method.

35. system as claimed in claim 34, wherein the page analyzer further include:

Segmenter, suitable for carrying out word segmentation processing to web page contents using the segmentation methods based on dictionary, wherein the segmentation methods Rule comprising a dictionary, two kinds of matching algorithms and four disambiguations.

36. system as claimed in claim 35, wherein the second feature extractor includes:

Coupling subelement, suitable for each Feature Words concentrated to second feature, sequentially searching whether there is this feature in word sequence Word,

If being matched to certain Feature Words in word sequence, the value of corresponding position is concentrated to be assigned to 1 in second feature the specific word,

If not being matched to certain Feature Words in word sequence, the value of corresponding position is concentrated to be assigned in second feature the specific word 0；And

The second feature extractor is further adapted for generating dimension according to the assignment of Feature Words corresponding position being the first predetermined number Second feature vector.

37. system as claimed in claim 36, wherein

The page analyzer is further adapted for obtaining the web page contents of preset webpage, and carries out at participle to acquired web page contents Reason obtains word sequence；

The system also includes:

Feature set generation unit, suitable for each word in word sequence, the Second Eigenvalue of the computational representation word importance, And the first predetermined number word is chosen as Feature Words according to the Second Eigenvalue, form second feature collection.

38. system as claimed in claim 37, wherein the Second Eigenvalue is defined as under conditions of there is certain word, In webpage whether comprising hostile content probability distribution and webpage whether include hostile content probability distribution distance.

39. system as claimed in claim 38, wherein the Second Eigenvalue is the expectation cross entropy CE (w) of word w:

40. system as claimed in claim 39, wherein

The feature set generation unit is configured as choosing the first predetermined number according to the sequence of Second Eigenvalue from high to low Word constitutes second feature collection as Feature Words.

41. system as claimed in claim 40, wherein

The page analyzer is further adapted for crawl and largely has been marked as not including the webpage of hostile content and comprising hostile content Webpage web page contents as sample data；

The second feature extractor is further adapted for the Feature Words concentrated according to second feature, generates as in the webpage of sample data The second feature vector of appearance is as training parameter；And

The system also includes the second training unit, it is suitable for obtaining second using the machine learning algorithm training training parameter Characteristic model.

42. system as claimed in claim 41, further includes:

Second updating unit, suitable for updating sample data, repetition training step, to regularly update second feature in the given time Model.

43. system as claimed in claim 42, wherein first predetermined number is between 450-550.

44. system as claimed in claim 43, wherein

If the second result of the output is 1, then it represents that webpage to be identified includes hostile content；With

If the second result of the output is 0, then it represents that webpage to be identified does not include hostile content.

45. system as claimed in claim 44, wherein the machine learning algorithm is support vector machine method.

46. system as claimed in claim 45, wherein

The first information acquiring unit is further adapted for parsing the URL of webpage to be identified, obtains the domain name, simultaneously of the webpage to be identified And using domain name as the first identity information of the webpage to be identified.

47. system as claimed in claim 46, wherein second information acquisition unit further include:

Subelement is counted, all exterior chains suitable for counting the webpage to be identified extracted pick out existing number；And

Second information acquisition unit is further adapted for choosing the domain name of the most outer link of frequency of occurrence as the second identity information.

48. system as claimed in claim 47, wherein the third recognition unit is suitable for: