CN109284465B

CN109284465B - URL-based web page classifier construction method and classification method thereof

Info

Publication number: CN109284465B
Application number: CN201811025751.3A
Authority: CN
Inventors: 孙玉霞; 赵晶晶; 仇之
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2018-09-04
Filing date: 2018-09-04
Publication date: 2021-03-19
Anticipated expiration: 2038-09-04
Also published as: CN109284465A

Abstract

The invention discloses a URL-based web page classifier construction method and a classification method thereof, firstly, URLs of a plurality of web pages are obtained, the web page attribute is marked aiming at each URL, and each URL marked with the web page attribute is used as a training sample to form a training sample set; aiming at each training sample in the training sample set, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector; and training the word vectors of all the training samples marked with the webpage attributes in the training sample set by taking the word vectors as input aiming at the convolutional neural network to obtain the webpage classifier. Aiming at the web pages needing to be classified, firstly, acquiring the URL of the web pages as a test sample; then, carrying out word segmentation processing on the selected characters through the selected characters, and finally converting the selected characters into word vectors; and inputting the word vectors of the test samples into the constructed web page classifier, and outputting a classification result through the web page classifier. The method and the device greatly improve the classification accuracy of the malicious web pages.

Description

URL-based web page classifier construction method and classification method thereof

Technical Field

The invention relates to the technical field of information security, in particular to a webpage classifier construction method based on a Uniform Resource Locator (URL) and a classification method thereof.

Background

The openness and virtualization of the internet make privacy, data and transaction safety face serious challenges, and in recent years, the behavior of using malicious web pages to conduct network crimes is rampant. According to statistics, nearly one third of the webpages have potential maliciousness, and the malicious webpages attack users by sending junk mails, phishing and other modes, so that the users without any security defense consciousness suffer various damages, including fund loss, private information embezzlement and the like, and the property and information security of the users are seriously threatened. Therefore, how to timely and effectively identify the malicious web pages becomes an important problem to be solved urgently at present.

In the prior art, whether a webpage is a malicious webpage is generally identified by detecting the content and the behavior of the webpage; when the malicious webpage is identified by detecting the content of the webpage, text and picture content, malicious code fragments, behavior records in a server or a proxy log and the like of the webpage need to be detected, so that the difficulty that the content of the webpage is changeable, can be encrypted or equivalently replaced and the like cannot be avoided by identifying the malicious webpage through the content of the malicious webpage. When malicious web pages are identified by detecting the behavior of the web pages, the problems that the dynamic behavior of the web pages is difficult to trigger and track and the like must be faced.

Disclosure of Invention

The first purpose of the present invention is to overcome the disadvantages and shortcomings of the prior art, and to provide a method for constructing a web page classifier based on a Uniform Resource Locator (URL), where the web page classifier constructed by the method greatly improves the classification accuracy of malicious web pages.

The second purpose of the present invention is to provide a URL-based web page classification method implemented by the classifier constructed as above.

The first purpose of the invention is realized by the following technical scheme: a web page classifier construction method based on URL includes the following steps:

step S1, obtaining URLs of a plurality of webpages, marking the webpage attributes aiming at the URLs, and forming a training sample set by taking the URLs with marked webpage attributes as training samples;

step S2, aiming at each training sample in the training sample set, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector;

and step S3, training the word vectors of the training samples marked with the webpage attributes in the training sample set by taking the word vectors as input aiming at the convolutional neural network to obtain the webpage classifier.

Preferably, in step S1, the URLs of the web pages are obtained from the good and malicious URL repository, and the training sample set includes a certain number of URLs whose web page attributes are malicious and a certain number of URLs whose web page attributes are good.

Preferably, in the step S2, the selected character includes "? "," & "," - "and" # ".

Preferably, in step S2, Word2vec is used to convert the training samples into Word vectors according to the results of the Word segmentation processing.

Further, in step S2, when the Word vector is obtained by using Word2vec conversion, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count.

Preferably, the convolutional neural network is constructed to include, from input to output, a first part, a second part, a third part, a fourth part, and a fifth part in this order; wherein:

the first part is an input layer and is used for inputting word vectors of all training samples;

the second part comprises a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer in sequence from input to output and is used for extracting context semantics of various degrees; the first convolution layer and the second convolution layer respectively comprise convolution kernels with three sizes, and the first convolution layer and the second convolution layer are the same in size;

the third part is a vector merging layer and is used for merging the convolution results of the convolution kernels of the second part into a feature vector;

the fourth part is a full connection layer and comprises a first full connection layer and a second full connection layer, the first full connection layer carries out Dropout processing on the feature vectors, and the second full connection layer obtains the category with the highest score corresponding to the feature vectors through a classifier;

and the fifth part is an output layer and is used for outputting a classification result.

Preferably, after the training sample set is obtained in step S1, the deduplication processing is performed on the training sample set, specifically as follows: n selects an initial value, obtains the first N characters of each training sample in the training sample set, and only one URL with the same first N characters in the training sample set is left after the deduplication processing, then judges whether the total number of the training samples in the training sample set is less than or equal to a threshold value, if not, reduces the value of N, and performs the same processing until the total number of the training samples in the training sample set is reduced to be less than or equal to the threshold value; and aiming at the final training sample set obtained after the duplication removing treatment, performing word segmentation treatment on each training sample in the training sample set through selected characters, and then converting the training samples into word vectors.

Furthermore, N is an integer of 20-30.

The second purpose of the invention is realized by the following technical scheme: a webpage classification method based on URL includes the following steps:

step X1, aiming at the webpage needing to be classified, firstly, acquiring the URL of the webpage as a test sample; then, carrying out word segmentation processing on the test sample through the selected characters, and finally converting the test sample into word vectors;

and step X2, inputting the word vectors of the test samples into the web page classifier constructed by the first objective method of the invention, and outputting the classification result through the web page classifier.

Preferably, in the step X1, by the selected character "? "," & "," - "and" # "perform a word segmentation process on each test sample;

in the step X1, Word2vec is used to convert the result after the Word segmentation processing for each test sample into a Word vector.

Compared with the prior art, the invention has the following advantages and effects:

(1) the invention relates to a URL-based web page classifier construction method, which comprises the steps of firstly, obtaining URLs of a plurality of web pages, marking web page attributes aiming at the URLs, and forming a training sample set by the URLs with marked web page attributes; for each training sample, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector; and acquiring a constructed convolutional neural network model, and training the convolutional neural network by taking the word vectors of the training samples marked with the webpage attributes in the training sample set as input to obtain the webpage classifier. Therefore, the method provided by the invention builds the web page classifier by training the convolutional neural network on the basis of the vocabulary characteristics of the URL of the web page, and the URL of the web page is static and fixed and does not change, so that the classification result of the web page classifier built by the method provided by the invention is not influenced by the content of the web page and the dynamic behavior of the web page, the classification accuracy of malicious web pages can be greatly improved, and compared with the web page detection method in the prior art, the method provided by the invention has the advantages of simplicity in operation, low recall rate, low false report rate and low false report rate.

(2) In the method for constructing the web page classifier based on the URL, each training sample is subjected to word segmentation processing through selected characters; the URL is a unique address of each piece of information in the network, and consists of three parts: the resource type, the domain name of the host where the resource is located and the file name of the resource are in the following basic format, and the protocol is:// user name: password @ sub-domain name. Parameter # value flag. The three sections are separated by "/", the host name and the domain name are separated by ". times", and the common separators for transferring parameters are "? "," & "," - ". In general, phishing webpages work as articles between domain names and host names, and some domain name confusion malicious behaviors are performed, such as XSS cross site attacks and SQL injection. Is selected in the method of the present invention "? Six separators of ",", "&", "-" and "#" cut the URL link so that important information in the URL can be extracted, further improving the classification accuracy of the constructed web page classifier.

(3) In the method for constructing the webpage classifier based on the URL, a convolutional neural network is constructed from input to output and sequentially comprises a first part, a second part, a third part, a fourth part and a fifth part; the structures of all parts are specially set based on the vocabulary characteristics, so the web page classifier obtained by constructing the convolutional neural network training is more targeted for web page classification.

(4) The method for constructing the webpage classifier based on the URL comprises the following steps of carrying out duplicate removal treatment on a training sample set: selecting a value N, obtaining the first N characters of each training sample in a training sample set, aiming at URLs with the same first N characters in the training sample set, only leaving one URL after deduplication processing, then judging that the number of the training samples in the training sample set is smaller than or equal to a threshold value, if not, reducing the value of N, and carrying out the same processing until the number of the training samples in the training sample set is smaller than or equal to the threshold value; in the invention, the duplication elimination treatment greatly reduces the number of repeated training samples in the training sample set and improves the precision of the training sample set; therefore, the training sample set obtained by the operation of the invention can reduce the computational complexity and accelerate the construction speed of the web page classifier under the condition of ensuring the classification accuracy of the constructed classifier.

(5) The invention relates to a webpage classification method based on URL, aiming at the webpage needing classification, firstly acquiring the URL of the webpage as a test sample; performing word segmentation processing on each test sample through the selected characters, and finally converting the word into a word vector; and inputting the word vectors of the test samples into the web page classifier constructed by the method, and outputting a classification result through the web page classifier. Compared with other classification methods in the prior art, the webpage classification method has the advantage of higher malicious webpage detection rate.

Drawings

FIG. 1 is a flowchart of a method for constructing a URL-based web page classifier according to the present invention.

FIG. 2 is a diagram of a convolutional neural network model constructed in accordance with the present invention.

Fig. 3 is a process of deduplication processing of a training sample set in embodiment 2 of the present invention.

Fig. 4 is a storage format of a training sample set in embodiment 2 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

The invention discloses a method for constructing a webpage classifier based on URL (Uniform resource locator), which comprises the following steps as shown in figure 1:

step S1, obtaining URLs of a plurality of webpages, marking the webpage attributes aiming at the URLs, and forming a training sample set by taking the URLs with marked webpage attributes as training samples; in this embodiment, URLs of a plurality of web pages are obtained from a good and malicious URL repository, and the training sample set includes URLs whose web page attributes are malicious and URLs whose web page attributes are good.

Step S2, for each training sample in the training sample set, performing word segmentation processing on each training sample by using the selected character, and then converting the training sample into a word vector.

In this embodiment, the characters selected in this step include "? "," & "," - "and" # ", i.e. by"? "," "&", "-" and "#" perform word segmentation processing on each training sample, for example, a certain training sample corresponds to a URL:

tudu-free.blogspot.com/2008/02/jogos-java-aplicativos.html#footer-wrap2；

then, after the word segmentation processing is performed through the selected characters, the following steps are performed: 'tudu', 'free', 'blogspot', 'com', '2008', '02', 'jogs', 'java', 'aplativos', 'html', 'footer', 'wrap 2'.

In this embodiment, for the result after the Word segmentation processing of each training sample, Word2vec is used to convert the training sample into a Word vector. When a Word vector is obtained using Word2vec conversion, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count. In this embodiment, the word embedding dimension embedding-size, the context window size window, and the minimum word frequency min _ count are set to 128, 5, and 4, respectively.

In this embodiment, the convolutional neural network is constructed as shown in fig. 2, specifically: the input-output circuit sequentially comprises a first part, a second part, a third part, a fourth part and a fifth part; wherein:

the second part comprises a first convolution layer (convolution layer 1), a first pooling layer (pooling layer 1), a second convolution layer (convolution layer 2) and a second pooling layer (pooling layer 2) in sequence from input to output and is used for extracting context semantics of various degrees; the first convolution layer and the second convolution layer both comprise convolution kernels with three sizes, and the first convolution layer and the second convolution layer are the same in size.

And the third part is a vector merging layer and is used for merging the convolution results of the convolution kernels of the second part into a feature vector.

The fourth part is a full-link layer, which includes a first full-link layer (full-link layer 1) and a second full-link layer (full-link layer 2), where the first full-link layer performs Dropout processing on the feature vector, in this embodiment, Dropout is set to 0.3, and the second full-link layer obtains the category with the highest score corresponding to the feature vector through a softmax classifier; in the present embodiment 1, it is shown that,

and the fifth part is an output layer and is used for outputting the classification result.

The network layer parameters of the convolutional neural network model constructed in this embodiment are shown in table 1 below:

TABLE 1

Therefore, the method provided by the embodiment of the invention has the advantages that the web page classifier is constructed and obtained by training the convolutional neural network on the basis of the vocabulary characteristics of the URL of the web page, the URL of the web page is static and fixed and does not change, the classification result of the web page classifier constructed and obtained by the method provided by the invention is not easily influenced by the content and the dynamic behavior of the web page, the classification accuracy of malicious web pages can be greatly improved, and in addition, compared with the web page detection method in the prior art, the method provided by the invention has the advantages of simplicity in operation, low recall rate, low false report rate and low false report rate.

The embodiment also discloses a webpage classification method based on the URL, which comprises the following steps:

in this embodiment, the step is by the selected character "? "," & "," - "and" # "perform a word segmentation process on each test sample; for the results after the test sample Word segmentation processing, Word2vec is used to convert into Word vectors. When Word2vec conversion is used to obtain a Word vector, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count. In this embodiment, the word embedding dimension embedding-size, the context window size window, and the minimum word frequency min _ count are set to 128, 5, and 4, respectively.

Step X2, inputting the word vectors of the test samples into the web page classifier constructed by the method of this embodiment, and outputting the classification result through the web page classifier.

Example 2

The embodiment discloses a method for constructing a web page classifier based on a URL, which is different from the method for constructing a web page classifier based on a URL in embodiment 1 only as follows:

in this embodiment, after the training sample set is obtained in step S1, the method further includes a step of performing deduplication processing on the training sample set, as shown in fig. 3, specifically as follows: n selects an initial value, obtains the first N characters of each training sample in the training sample set, and only one URL with the same first N characters in the training sample set is left after the deduplication processing, then judges whether the total number of the training samples in the training sample set is less than or equal to a threshold value, if not, reduces the value of N, and performs the same processing until the total number of the training samples in the training sample set is reduced to be less than or equal to the threshold value; and aiming at the final training sample set obtained after the duplication removing treatment, performing word segmentation treatment on each training sample in the training sample set through selected characters, and then converting the training samples into word vectors. In the embodiment, N is an integer of 20-30.

In this embodiment, after N is selected as an initial value, for example, equal to 30, URLs with the same first N characters in the training sample set are obtained, and for URLs with the same first N characters, only one of them is deleted after deduplication processing, and after this processing, if the total number of samples in the training sample set is greater than or equal to a certain threshold, the value of N is decreased, where the value of N is decreased by a certain value, for example, 5, each time, and the same processing is performed until the number of samples in the training sample set is less than or equal to the threshold, and the threshold may be set to 10 ten thousand in this embodiment. For example, if the total number of training samples in the training sample set is C, and the initial value of the N value is 30, the N value is first selected to be 30, and if x1 URLs with first 30 characters being a1, a2, a3, … a30 and y1 URLs with first 30 characters being b1, b2, b3, … b30 in the training sample set, the deduplication processing is performed, specifically: deleting x1-1 URLs of which the first 30 characters are a1, a2, a3 and … a30, and only remaining 1 URL of which the first 30 characters are a1, a2, a3 and … a 30; deleting y1-1 URLs of which the first 30 characters are b1, b2, b3 and … b30, and only remaining 1 URL of which the first 30 characters are b1, b2, b3 and … b 30; then after the duplication elimination treatment, judging whether the total number C of the samples in the training sample set is less than or equal to 10 ten thousand, if not, reducing the value of N to 25; next, if the x2 first 25 characters in the training sample set are URLs of c1, c2, c3, … c25, and if the y2 first 30 characters in the training sample set are URLs of d1, d2, d3, … d25, performing deduplication processing, specifically: deleting x2-1 URLs of which the first 25 characters are c1, c2, c3 and … c25, and only remaining 1 URL of which the first 25 characters are c1, c2, c3 and … c 25; deleting y2-1 URLs of which the first 25 characters are d1, d2, d3 and … d25, and only remaining 1 URL of which the first 25 characters are d1, d2, d3 and … d 25; then after the duplication elimination treatment, judging whether the total number C of the samples in the training sample set is less than or equal to 10 ten thousand, if not, reducing the value of N to 20; and then, the steps are repeatedly executed until the total number C of the samples in the training sample set is less than or equal to 10 ten thousand.

In this step of this embodiment, the URLs of 200000 web pages are obtained from a good and malicious URL repository, first N characters of the URLs are obtained by using a left function in Excel, deduplication processing is performed according to the first N characters of the URLs, the value of N is continuously adjusted, that is, N is gradually reduced, then manual screening is performed, invalid data is deleted by manual screening, and finally about 50000 URLs of benign web pages and malicious web pages are obtained, and the processed data are labeled (good (malicious) and bad (benign), and in this embodiment, 122331 URLs are finally obtained as training samples by combining with valid data in the existing public URL data sets, and as shown in fig. 4, the training samples are 49060-49076 th training samples, wherein the training sample sets are stored in a cvs format.

The embodiment also discloses a webpage classification method based on a URL, which is different from the webpage classification method in embodiment 1 only in that the webpage classifier obtained by the webpage classifier construction method is used in step X2.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A webpage classifier construction method based on URL is characterized by comprising the following steps:

step S2, aiming at each training sample in the training sample set, performing word segmentation processing on each training sample through selected characters, and then converting the word into a word vector; the selected characters include "? "," & "," - "and" # ";

step S3, training the word vectors of all training samples marked with the webpage attributes in the training sample set by taking the word vectors as input aiming at the convolutional neural network to obtain a webpage classifier;

in step S1, acquiring URLs of multiple web pages from the benign and malicious URL repository, where the training sample set includes a certain number of URLs whose web page attributes are malicious and a certain number of URLs whose web page attributes are benign;

after the training sample set is obtained in step S1, deduplication processing is performed on the training sample set, which is specifically as follows: n selects an initial value, obtains the first N characters of each training sample in the training sample set, and only one URL with the same first N characters in the training sample set is left after the deduplication processing, then judges whether the total number of the training samples in the training sample set is less than or equal to a threshold value, if not, reduces the value of N, and performs the same processing until the total number of the training samples in the training sample set is reduced to be less than or equal to the threshold value; and aiming at the final training sample set obtained after the duplication removing treatment, performing word segmentation treatment on each training sample in the training sample set through selected characters, and then converting the training samples into word vectors.

2. The method for constructing a URL-based web page classifier according to claim 1, wherein in step S2, Word2vec is used to convert the training samples into Word vectors according to the result of the Word segmentation process.

3. The method for constructing a URL-based web page classifier according to claim 1, wherein in step S2, when Word vector is obtained by using Word2vec conversion, the following parameters are set: word embedding dimension encoding-size, context window size window, and minimum word frequency min count.

4. The URL-based web page classifier construction method according to claim 1, wherein the convolutional neural network is constructed to include, in order from input to output, a first part, a second part, a third part, a fourth part, and a fifth part; wherein:

5. The URL-based web page classifier building method according to claim 1, wherein N is an integer of 20-30.

6. A webpage classification method based on URL is characterized by comprising the following steps:

step X2, inputting the word vectors of the test samples into the web page classifier constructed by the method of any one of claims 1 to 5, and outputting the classification result through the web page classifier.

7. The URL-based web page classification method according to claim 6,

in said step X1, by the selected character "? "," & "," - "and" # "perform a word segmentation process on each test sample;