CN107957872A

CN107957872A - A kind of full web site source code acquisition methods and illegal website detection method, system

Info

Publication number: CN107957872A
Application number: CN201710940131.1A
Authority: CN
Inventors: 周发; 袁晓彤; 耿光刚; 延志伟; 李晓东
Original assignee: China Internet Network Information Center
Current assignee: China Internet Network Information Center
Priority date: 2017-10-11
Filing date: 2017-10-11
Publication date: 2018-04-24

Abstract

The invention discloses a kind of full web site source code acquisition methods and illegal website detection method, system.The system includes complete source code extraction module, characteristic extracting module and illegal website identification model；Complete source code extraction module is used for the complete source code for extracting website；Characteristic extracting module is used for the text feature for extracting complete source code, obtains the text feature set of the complete source code；And extract the non-textual statistical nature of the complete source code；Feature in each text feature set is merged and calculates the mean eigenvalue of each feature, obtains text feature file；Illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-textual statistical nature, judges whether the website to be identified is illegal website；Based on the text feature set of the corresponding complete source code in each website, non-textual statistical nature and machine learning algorithm in sample site measure set, the illegal website identification model is generated.The present invention improves website identification accuracy.

Description

A kind of full web site source code acquisition methods and illegal website detection method, system

Technical field

The present invention relates to a kind of full web site source code acquisition methods and illegal website detection method, system, belong to network skill Art field.

Background technology

With the development of internet, internet has come into every field.But likewise as and come, internet Also used by some traditional illegal industries, be such as used for peddling gun, drugs, manage gambling, pornographic etc..Meanwhile these are not It is also possible to that wooden horse, virus etc. can be embedded into good website.To these abuses of internet, the health hair of serious threat internet Exhibition and infringement netizen physical and mental health and property safety.In order to detect objectionable website, researcher proposes the detection side of many Method.

Black and white lists are one of means that illegal website differentiates.Major browser manufacturer passes through regular renewal blacklist To have the function that the identification to illegal website and play to remind user.Although blacklist is effective, shortcoming is apparent.Respectively A browser manufacturer needs often, upgrade in time the blacklist, if an illegal website is not indexed to blacklist in time, The illegal website can not be identified.

It is also one of means for differentiating illegal website based on content of text heuritic approach.This kind of algorithm relies on The bad keyword and sentence of preset in advance identify objectionable website, will if website includes these keywords or sentence Take illegal website as in website.This kind of algorithm is too simple, be easy to cause misclassification, for normal website, such as News Network Stand, be considered as illegal website if containing some keywords or sentence.For illegal website, this kind of method As blacklist, if keyword or sentence coverage are inadequate, None- identified, normal net is regarded as by illegal website Stand.

As machine learning is widely applied, machine learning is also applied to differentiate illegal website.Naive Bayesian, god Through network, support vector machines, decision tree scheduling algorithm Chen-Huei Chou etc. paper《A text mining approach to Internet abuse detection》In the experiment proved that two classification illegal websites identification in have good effect Fruit.But the problem of text message is to obtain feature, still remain identification inaccuracy in source code is only used in the paper.

For the builder of illegal website, in order to hide the detection for being directed to its website, it also using many reverse-examinations and survey Technology, further increases detection difficulty.Website current at the same time is difficult to obtain its complete source code using conventional method, if nothing Method obtains the html codes really, being completely shown in browser, then is difficult to realize accurately detect website.

The content of the invention

For technical problem existing in the prior art, it is an object of the invention to provide a kind of full web site source code acquisition side Method and illegal website detection method, system.

The present invention has found that many websites can be used and JavaScript is used in itself webpage in the illegal website of acquisition Code dynamic load shows illegal contents, or JavaScript code is not positioned in own website code, but uses The mode of Asynchronous loading is obtained from other addresses, and only in browser resolves, JavaScript code can just perform.Meanwhile Present invention discover that some websites will not use illegal contents in the webpage source code of its own, but the content of illegal web page is embedding It is sleeved on<iframe>In label,<iframe>Label also only in the original webpage source code of browser resolves, can be just loaded Into the display page of browser.Illegal website is by using these methods so that tester can not be obtained by instruments such as wegt The html codes for taking objectionable website really, to be completely shown in browser.So if can not obtain it is real, be completely shown in it is clear The html codes look in device, then be difficult to realize accurately detect.The present invention considers the non-textual statistical nature of some in html, such as In html structures<iframe>The features such as number of labels, while in the present invention in actual use, find random forests algorithm Excellent effect.

The technical scheme is that：

A kind of full web site source code acquisition methods, its step include：

1) for each targeted website, the JavaScript code of the dynamic load targeted website is carried out using PhantomJS, Obtain the html codes performed after JavaScript；

2) URL in the label for initiating request is obtained from the html codes, the html codes of the URL is obtained and is added Add to the correspondence position of the complete source code of the targeted website；

3) recursion step 2) processing, obtain the final complete source code of the targeted website.

Further, the label of the initiation request is<iframe>Label.

Further, in the step 2), a timeout mechanism is set, if not receiving the sound of current URL in setting time Should, then stop the access request to the URL.

A kind of illegal website detection method, its step include：

Obtain the complete source code of website to be identified；Text feature file in the identification model of illegal website, is treated from this Identify in the complete source code of website and extract feature of the corresponding feature as the complete source code of the website to be identified, this is complete The characteristic value of the feature of source code is arranged to the characteristic value of character pair in the text feature file；Extract the website to be identified The non-textual statistical nature of complete source code；

The complete source code feature and its characteristic value of the website to be identified, non-textual statistical nature are inputted into the illegal website Identification model, judges whether the website to be identified is illegal website；

Wherein, the generation method of the illegal website identification model is：Obtain each website in sample site measure set Complete source code, obtains complete source code set；The text feature of each complete source code in the complete source code set is extracted, is somebody's turn to do The text feature set of complete source code；Extract the non-textual statistical nature of each complete source code；To in each text feature set Feature merge and calculate the mean eigenvalue of each feature, obtain text feature file；

It is special based on the text feature set of the corresponding complete source code in each website, non-textual statistics in the sample site measure set Seek peace machine learning algorithm, generate illegal website identification model.

Further, the method for obtaining the text feature file is：Chinese information in the complete source code is divided Word and the TF-IDF values for calculating each participle；The information gain value of participle is then based on, multiple participles is chosen and is used as the complete source The feature of code, then the text feature set using the feature of selection and its corresponding TF-IDF values as the complete source code；Will be each Feature in text feature set is merged and calculated according to TF-IDF value of the same feature in different text feature set The average TF-IDF values of this feature, the text feature file is generated according to the feature after merging and its average TF-IDF values.

Further, the statistical nature includes the complete source code<iframe>Number of labels,<title>Label is put down Equal length, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of ＆ symbols.

Further, the machine learning algorithm is random forests algorithm.

A kind of illegal website detecting system, it is characterised in that including complete source code extraction module, characteristic extracting module and non- Net of justice station identification model；Wherein,

The complete source code extraction module, for extracting the complete source code of website；The website include website to be identified and Each website in sample site measure set；The sample site measure set includes multiple illegal websites and multiple legitimate sites；

The characteristic extracting module, for extracting the text feature of complete source code, obtains the text feature of the complete source code Set；And extract the non-textual statistical nature of the complete source code；Feature in each text feature set is merged simultaneously The mean eigenvalue of each feature is calculated, obtains text feature file；

The illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-text This statistical nature, judges whether the website to be identified is illegal website；

Wherein, based on the text feature set of the corresponding complete source code in each website, non-textual system in sample site measure set Feature and machine learning algorithm are counted, generates the illegal website identification model.

The present invention is directed to these anti-detection methods of illegal website, imitates the resolving of browser, really aobvious to obtain It is shown in the html codes of browser.Carry out dynamic load JavaScript code first by PhantomJS, acquisition has performed Html codes after JavaScript.Then from the html source codes got, obtain<iframe>URL in label, is Preventing multinest, recurrence of the present invention is repeated the above process using these URL, finally obtain true, complete html codes, It with the addition of timeout mechanism in code at the same time, for the website that cannot respond to, time-out is not asked, it is therefore prevented that the possibility of blocking.

After it can obtain true, complete html codes.The present invention obtains illegal website using this program Html codes, and normal web page code, complete the structure of data set.The present invention is extracted each first by TF-IDF The text feature of website source code, since the dimension of generation is excessive, the present invention then carries out feature selecting.The present invention calculates choosing at the same time The average TF-IDF values of feature are selected, are preserved the value with corresponding feature as tag file.The present invention in feature extraction phases, In addition to the feature obtained using TF-IDF, also extraction contains the non-textual statistical nature in html codes, such as picture number Amount,<iframe>Quantity, number of links,<div>Number of labels etc..By the use to these features, calculation is effectively improved The accuracy rate and recall rate of method.

The present invention is developed into browser plug-in in service stage.When user accesses some webpage, webpage is obtained Html codes, feature is extracted using the tag file preserved above, if including word present in tag file in webpage, this This feature value is just arranged to the corresponding TF-IDF values of the word in tag file by invention.Then the knot of the webpage is obtained using model Fruit.

Compared with prior art, the positive effect of the present invention is：

(1) due to imitating browser resolves process, so as to obtain true, full web site html codes so that bad The reverse-examination of website looks into measure failure, provides standard for the subsequent machine learning using web page text feature and structure of web page feature True complete feature set, improves the accuracy of algorithm.

(2) in extraction feature stage, except using and the text feature of webpage in addition to, it also is contemplated that it is non-in html codes Text statistical nature, end product contrast is found, by increasing these features so that the recognition effect of illegal website improves 4%.

Brief description of the drawings

Fig. 1 is flow chart of the method for the present invention.

Embodiment

To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's attached drawing to make Describe in detail.

The present invention extracts feature by capturing true, complete html codes, from html codes, then using the spy of extraction Sign obtains illegal website identification model as training set training random forest.Identify whether website is illegal using the model Website.All processes are realized using Python.The method flow of the present invention is as shown in Figure 1, its step includes：

1) present invention uses PhantomJS, PhantomJS is browsed for the script that is interacted automatically with webpage is without a head Device, i.e. PhantomJS do not have UI interfaces, there is provided a JavaScript API, it is possible to achieve the row such as self-navigation, sectional drawing To become Safari browser environments similar with Chrome's.The present invention is realized pair using PhantomJS The dynamic load of JavaScript scripts, so as to obtain the html codes for having performed JavaScript code.

2) after acquisition has loaded the html codes of JavaScript code, the present invention is then gone in parsing html codes When, browser can re-initiate the label of request, refer mainly to here<iframe>Label, many illegal websites can use< iframe>Label come realize it is shown in browser interface html codes hide.Website multilayer nest in order to prevent, this Invention is used here recursively mode and obtains the html codes of URL in iframe labels, is then added to corresponding Position, by the above process, can obtain complete real html codes.

3) present invention constructs one and includes 1800 URL, and wherein illegal website contains gun, pornographic, gambling, these Website is all from discovery of the present invention in actual process.Normal sample contains the website of each theme, which part URL uses search expression " site:.cn " obtained from search engine, it is considered herein that the URL obtained by this way can be with Ensure to be similarly from the website obtained in actual process of the present invention for normal website, remainder.Then, use is above-mentioned Acquisition html codes method to data concentrate domain Name acquisition html codes.

4) from the html codes of acquisition, the present invention obtains all Chinese texts in html codes first, uses jieba points Word, is segmented for the Chinese information in each webpage html codes, the input using the result of participle as TF-IDF, so that Obtain including the TF-IDF values of each participle word.Since webpage quantity is more, and content of text is more in webpage, so using The characteristic dimension that TF-IDF is obtained is more, so make choice unusual necessity to feature, the present invention is based on information gain, by successively decreasing 600 features before selection information gain value, construct the text feature set based on html texts, while the present invention calculates feature Value average TF-IDF values in these training sets, then preserve feature and the value as tag file.Obtaining html texts While eigen set, the present invention extracts non-textual statistical nature, including<iframe>Number of labels,<title> The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of ＆ symbols.

5) after obtaining feature set, the present invention will be divided into two using the data set obtained in step 4) in the data set Part, wherein 1500 data, as training set, 300 collect as verification, the present invention trains prison using the training set of structure The random forests algorithm that educational inspector practises, so that an illegal website identification model is obtained, then using verification collection come to the non-net of justice The effect for identification model of standing is assessed, and finally finds that Average Accuracy, recall rate and the f1 values of the model all reach 96%.

6) after the completion of training, the present invention preserves the illegal website identification model, and browser is developed into using Python Plug-in unit, when user browses webpage, which obtains html codes, then using the spy of the tag file extraction webpage preserved Sign, when including the word in tag file in webpage, i.e., sets the word feature value matched in webpage (i.e. TF-IDF values) The corresponding characteristic value of this feature in tag file (i.e. TF-IDF values) is set to, while also extracts the non-textual feature of webpage.Obtain After feature, which will call the model preserved, using the feature vector of the webpage as input, judge whether the webpage is non- Method.

Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area Member can be to technical scheme technical scheme is modified or replaced equivalently, without departing from the spirit and scope of the present invention, this hair Bright protection domain should be subject to described in claims.

Claims

1. a kind of full web site source code acquisition methods, its step include：

1) for each targeted website, the JavaScript code of the dynamic load targeted website is carried out using PhantomJS, is obtained The html codes after JavaScript are performed；

2) URL in the label for initiating request is obtained from the html codes, the html codes of the URL is obtained and adds it to The correspondence position of the complete source code of the targeted website；

2. the method as described in claim 1, it is characterised in that it is described initiate request label be<iframe>Label.

3. method as claimed in claim 1 or 2, it is characterised in that in the step 2), a timeout mechanism is set, if set The response of current URL is not received in fixing time, then stops the access request to the URL.

4. a kind of illegal website detection method, its step include：

Obtain the complete source code of website to be identified；Text feature file in the identification model of illegal website, it is to be identified from this Feature of the corresponding feature as the complete source code of the website to be identified is extracted in the complete source code of website, by the complete source code The characteristic value of feature be arranged to the characteristic value of character pair in the text feature file；Extract the complete of the website to be identified The non-textual statistical nature of source code；

The complete source code feature and its characteristic value of the website to be identified, non-textual statistical nature are inputted into illegal website identification Model, judges whether the website to be identified is illegal website；

Wherein, the generation method of the illegal website identification model is：Obtain the complete of each website in sample site measure set Source code, obtains complete source code set；The text feature of each complete source code in the complete source code set is extracted, it is complete to obtain this The text feature set of source code；Extract the non-textual statistical nature of each complete source code；To the spy in each text feature set Sign merges and calculates the mean eigenvalue of each feature, obtains text feature file；

Based on the text feature set of the corresponding complete source code in each website in the sample site measure set, non-textual statistical nature and Machine learning algorithm, generates illegal website identification model.

5. the method as described in right wants 4, it is characterised in that the method for obtaining the text feature file is：To the complete source Chinese information in code is segmented and calculates the TF-IDF values of each participle；The information gain value of participle is then based on, is chosen Feature of multiple participles as the complete source code, then using the feature of selection and its corresponding TF-IDF values as the complete source code Text feature set；Feature in each text feature set is merged and according to same feature in different text feature collection TF-IDF values in conjunction calculate the average TF-IDF values of this feature, according to the feature after merging and its average TF-IDF values generation institute State text feature file.

6. the method as described in right wants 4, it is characterised in that the statistical nature includes the complete source code<iframe>Label Quantity,<title>The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of ＆ symbols.

7. method as claimed in claim 4, it is characterised in that the machine learning algorithm is random forests algorithm.

8. a kind of illegal website detecting system, it is characterised in that including complete source code extraction module, characteristic extracting module and illegal Website identification model；Wherein,

The complete source code extraction module, for extracting the complete source code of website；The website includes website to be identified and sample Each website in set of websites；The sample site measure set includes multiple illegal websites and multiple legitimate sites；

The characteristic extracting module, for extracting the text feature of complete source code, obtains the text feature set of the complete source code； And extract the non-textual statistical nature of the complete source code；Feature in each text feature set is merged and is calculated every The mean eigenvalue of one feature, obtains text feature file；

The illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-textual Statistical nature, judges whether the website to be identified is illegal website；

Wherein, it is special based on the text feature set of the corresponding complete source code in each website, non-textual statistics in sample site measure set Seek peace machine learning algorithm, generate the illegal website identification model.

9. system as claimed in claim 8, it is characterised in that the characteristic extracting module believes the Chinese in the complete source code Breath is segmented and calculates the TF-IDF values of each participle；The information gain value of participle is then based on, chooses multiple participle conducts The feature of the complete source code, then the text feature collection using the feature of selection and its corresponding TF-IDF values as the complete source code Close；Feature in each text feature set is merged and the TF-IDF according to same feature in different text feature set Value calculates the average TF-IDF values of this feature, and the text feature text is generated according to the feature after merging and its average TF-IDF values Part.

10. system as claimed in claim 8, it is characterised in that the statistical nature includes the complete source code<iframe> Number of labels,<title>The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the number of ＆ symbols Amount.