CN115941327A

CN115941327A - Multilayer malicious URL identification method based on learning type bloom filter

Info

Publication number: CN115941327A
Application number: CN202211570363.XA
Authority: CN
Inventors: 孟虎
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-04-07

Abstract

The application relates to a multi-layer malicious URL identification method and a malicious URL identification model construction method based on a learning type bloom filter, wherein the model construction method comprises the following steps: initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter; training the classifier for multiple times based on a training data set to obtain a trained classifier; initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a hash manner, and constructing the post-filter. The application introduces a pre-filter to exclude most of the information about malicious URLs; then, the false negative rate is reduced by a classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the use of the pre-filter can reduce the false positive probability of the classifier, enhance the robustness of the model and further reduce the space overhead of the post-filter.

Description

Multilayer malicious URL identification method based on learning type bloom filter

Technical Field

The application relates to the technical field of network information security, in particular to a multi-layer malicious URL identification method based on a learning type bloom filter.

Background

The internet also becomes a platform for network attack due to openness and anonymity while bringing rich resources and convenient services to users. The network attack of lawless persons on individuals can cause huge threats to the security of private information and property, and the network attack aiming at national financial and government affair data platforms can cause irreparable loss. Among the various network security problems, network attacks based on malicious web pages are numerous. The spread range of the malicious web page is wide, and one malicious web page can be accessed by thousands of users within a few minutes. In the big data era, how to realize accurate and quick identification of malicious web pages has become an urgent and challenging task.

At present, the filtering technology of malicious URLs is mainly divided into three categories: the method comprises a black and white list filtering technology, a machine learning-based malicious URL detection method and a deep learning-based malicious URL identification method. However, the black-and-white list technology is essentially a kind of 'after-the-fact' detection, and although the browser can block the loading process of the malicious web page through the technology, the browser can only block the known types, and has serious hysteresis, which easily causes the missing judgment; the malicious URL detection technology based on machine learning needs to artificially extract data characteristics, and is difficult to deal with complex data sets; although the detection method based on deep learning can extract features from the data, when the URL is faced with the URL which is continuously increased and updated rapidly, in order to ensure the accuracy of classification, the data needs to be continuously adjusted to retrain the model, the training process needs to be carried out on hardware devices such as a graphic processor, and the cost is high.

Disclosure of Invention

In order to overcome at least one of the deficiencies in the prior art, embodiments of the present application provide a multi-layer malicious URL identification method based on a learning-type bloom filter.

In a first aspect, a method for constructing a malicious URL identification model is provided, including:

dividing the URL data set into a pre-filter construction data set and a training data set;

initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;

training the classifier for multiple times based on a training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;

initializing a post-standard bloom filter, mapping data which are misjudged as malicious URLs by the trained classifier into the initialized post-standard bloom filter in a hash manner, and constructing the post-filter;

and the pre-filter, the trained classifier and the post-filter form a malicious URL recognition model.

In one embodiment, partitioning the URL data set into a pre-filter build data set and a training data set includes:

the URL data set comprises benign URLs and malicious URLs, the benign URLs are divided into two parts according to a first set proportion, namely s1 and s2, the malicious URLs are divided into two parts according to a second set proportion, namely s3 and s4, the s1 forms a pre-filter construction data set, the s2 and the s3 form training data in a training data set, and the s4 forms test data in the training data set.

In one embodiment, initializing a pre-standard bloom filter includes:

according to the data volume n and the expected misjudgment rate FP in the data set constructed by the pre-filter, determining the bit number m of the pre-standard bloom filter;

determining the number k of hash rounds when the false positive rate reaches the minimum according to the bit number m and the data size n of the data set constructed by the pre-filter;

and calling a constructor of a pre-standard bloom filter library function by taking the bit number m and the hash round number k as parameters, and initializing the pre-standard bloom filter.

In one embodiment, hashing the pre-filter build dataset into an initialized pre-standard bloom filter, the building of the pre-filter comprises:

respectively performing k rounds of hash calculation on each URL in the pre-filter construction dataset to obtain index values of k storage positions, and setting the values of the k storage positions to be 1;

wherein, in the ith round, computing a hash computation for the URL, comprising:

when i =1, calculating a hash value of the URL, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the first k upper bits to perform modulo calculation on the bit number m to obtain an index value of a storage position corresponding to the ith round;

and when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the first k upper bits to perform modulo calculation on the bit number m, and calculating to obtain the index value of the storage position corresponding to the ith round.

In one embodiment, training the classifier for multiple times based on a training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU, and comprises:

the training data set comprises training data and test data;

inputting training data into a convolutional neural network CNN, and outputting a pooling sequence matrix;

inputting the pooling sequence matrix into a recurrent neural network GRU, and outputting a hidden state sequence;

converting the hidden state sequence into classification probability by adopting a Softmax function, judging the URL to be a benign URL when the classification probability is greater than a threshold value of a classifier, and judging the URL to be a malicious URL when the classification probability is smaller than the threshold value of the classifier;

and inputting the test data into a classifier, when the loss function is minimum, taking the corresponding classification probability as a classifier threshold, and finishing training to obtain the trained classifier.

In one embodiment, inputting training data to a convolutional neural network CNN, outputting a pooled sequence matrix, comprising:

obtaining vector representation of training data according to the glove vocabulary file to form an input matrix;

and performing dimension reduction processing on the input matrix by adopting a word embedding method to obtain a pooling sequence matrix.

In one embodiment, initializing a post-standard bloom filter comprises:

misjudging the data quantity n' of the data of the malicious URL according to the trained classifier, and expecting misjudging rate F _P ', determining the bit number m' of the post-standard bloom filter;

determining the number k ' of hash rounds when the false positive rate reaches the minimum according to the bit number m ' and the data size n ';

and calling a constructor of a post-standard bloom filter library function by taking the bit number m 'and the hash round number k' as parameters, and initializing the post-standard bloom filter.

In one embodiment, hash mapping the data misjudged as the malicious URL by the trained classifier into an initialized post-standard bloom filter, and constructing the post-filter includes:

respectively performing k ' round hash calculation on each URL in the data which is judged as the malicious URL by the trained classifier to obtain index values of k ' storage positions, and setting the values of the k ' storage positions as 1;

when i =1, calculating a hash value of the URL, exchanging the high 15 bits and the low 15 bits of the hash value, selecting the first k 'high-bit values to perform modulo calculation on a bit number m', and obtaining an index value of a storage position corresponding to the ith round;

and when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the upper k 'bits, and performing modulo calculation on the bit number m' to obtain the index value of the storage position corresponding to the ith round.

In a second aspect, a multi-layer malicious URL identification method based on a learning-type bloom filter is provided, which includes:

the method adopts a malicious URL identification model for identification, wherein the malicious URL identification model comprises a prefilter, a classifier and a post-filter, and is obtained by applying the malicious URL model construction method of any one of claims 1 to 8;

inputting the URL to be identified into a pre-filter, and judging the URL to be identified as a benign URL or a malicious URL;

if the URL to be identified is a benign URL, inputting the URL to be identified into a classifier, and judging that the URL to be identified is the benign URL or a malicious URL;

and if the URL to be identified is a malicious URL, inputting the URL to be identified into a post filter, and judging that the URL to be identified is a benign URL or a malicious URL.

In a third aspect, an apparatus for building a malicious URL identification model is provided, including:

the data set dividing module is used for dividing the URL data set into a pre-filter construction data set and a training data set;

the pre-filter construction module is used for initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;

the classifier training module is used for training the classifier for multiple times based on the training data set to obtain the trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;

the post-filter construction module is used for initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a Hash manner, and constructing the post-filter;

and the malicious URL recognition model forming module is used for forming a malicious URL recognition model by the pre-filter, the trained classifier and the post-filter.

Compared with the prior art, the method has the following beneficial effects:

1. the application introduces a pre-filter before the classifier so as to exclude most of the malicious URLs; then, the false negative rate is reduced by a classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the use of the pre-filter can reduce the false positive probability of the classifier, enhance the robustness of the model and further reduce the space overhead of the post-filter.

2. The method introduces a deep learning technology, can fully utilize the distribution information of the indexed URL data, automatically extracts features, constructs a malicious URL detection model, and reduces the space overhead by 15% compared with a malicious URL detection method based on a black-and-white list technology under the condition that the given misjudgment rate is 1%.

Drawings

The present application may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, along with the detailed description below. In the drawings:

FIG. 1 is a flow chart diagram illustrating a method for constructing a malicious URL identification model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating vector representation of URLs in training data according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a multi-layer malicious URL identification method based on a learning-based bloom filter according to an embodiment of the present disclosure;

fig. 4 shows a block diagram of a malicious URL model building apparatus according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Here, it should be further noted that, in order to avoid obscuring the present application with unnecessary details, only the device structure closely related to the solution according to the present application is shown in the drawings, and other details not so related to the present application are omitted.

It is to be understood that the application is not limited to the described embodiments, since the description proceeds with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.

Aiming at the problems of high false judgment rate and high space overhead in the existing malicious URL identification method, the application provides a multi-layer malicious URL identification method based on a learning type bloom filter.

Fig. 1 shows a flow chart of a method for constructing a malicious URL identification model according to an embodiment of the present application, where the method includes:

s11, dividing the URL data set into a pre-filter construction data set and a training data set;

specifically, in this step, the URL data set includes a benign URL and a malicious URL, and the benign URL is expressed as 3:7, dividing the ratio of the malicious URL into two parts, namely s1 and s2, and dividing the malicious URL into two parts according to the ratio of 9: the scale of 1 is divided into two parts, s3 and s4, respectively, s1 constituting the prefilter construction data set, s2 and s3 constituting the training data in the training data set, and s4 constituting the test data in the training data set.

Step S12, initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;

specifically, initializing a pre-standard bloom filter includes:

constructing the data volume n and the expected misjudgment rate F in the data set according to the prefilter _P Determining the bit number m of the pre-standard bloom filter; here, the number m of bits of the pre-standard bloom filter may be determined using the following formula:

determining the number k of hash rounds when the false positive rate reaches the minimum according to the bit number m and the data size n of the data set constructed by the pre-filter; here, the hash round number k may be determined using the following equation:

Specifically, hash mapping the pre-filter construction data set into an initialized pre-standard bloom filter, and constructing the pre-filter, includes:

respectively performing k rounds of hash calculation on each URL in the pre-filter construction data set to obtain index values of k storage positions, and setting the values of the k storage positions to be 1; the index values are used for determining the positions of the storage positions, wherein the values of the storage positions in the initialized pre-standard bloom filter are all 0, when the index values of k storage positions are obtained through calculation, the values of k storage positions are set to be 1, if the value of a certain storage position is already set to be 1 in the previous i-1 round of calculation in the ith round of calculation, and when the index value of the storage position is obtained again in the ith round of calculation, the value of the storage position is not modified.

when i =1, calculating the hash value of the URL by adopting a MurMurHash2 hash algorithm, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the first k upper bits to perform modulo calculation on the bit number m, and obtaining the index value of the storage position corresponding to the ith round;

and when i is larger than or equal to 2, calculating the hash value of the URL by adopting a MurMurHash2 hash algorithm, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the first k high bits to perform modulo calculation on the bit number m, and obtaining the index value of the storage position corresponding to the ith round.

S13, training the classifier for multiple times based on the training data set to obtain the trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;

step S14, initializing a post-standard bloom filter, mapping the data which is misjudged as malicious URL by the trained classifier into the initialized post-standard bloom filter in a Hash manner, and constructing the post-standard bloom filter;

specifically, initializing a post-standard bloom filter includes:

misjudging the data quantity n' of the data of the malicious URL according to the trained classifier, and expecting misjudging rate F _P ', determining the bit number m ' of the post-standard bloom filter, wherein n ' = nF _n ，F _n As the probability of the data being misinterpreted as a malicious URL by the classifier, here, the number of bits m' may be determined using the following formula:

determining the number k ' of hash rounds when the false positive rate reaches the minimum according to the bit number m ' and the data size n '; here, the hash round number k' may be determined using the following equation:

Specifically, the data hash mapping that the trained classifier misjudges as the malicious URL is mapped into the initialized post-standard bloom filter, and the post-filter is constructed, including:

when i =1, calculating a hash value of the URL by adopting a MurMurHash2 hash algorithm, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the first k 'upper bits to perform modulo calculation on a bit number m', and obtaining an index value of a storage position corresponding to the ith round;

and when i is larger than or equal to 2, calculating the hash value of the URL by adopting a MurMurHash2 hash algorithm, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, and performing modulo calculation on the bit number m 'of the value of the upper k' bits to obtain the index value of the storage position corresponding to the ith round.

And S15, forming a malicious URL recognition model by the pre-filter, the trained classifier and the post-filter.

In the above embodiment of the present application, a pre-filter is introduced before the classifier, so as to exclude most of the malicious URLs; then, the false negative rate is reduced by the classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the false positive probability of the classifier can be reduced by using the pre-filter, the robustness of the model is enhanced, and the space overhead of the post-filter is further reduced; by introducing the deep learning technology, the distribution information of the indexed URL data can be fully utilized, the characteristics can be automatically extracted, the malicious URL detection model is constructed, and under the condition that the given misjudgment rate is 1%, the space overhead is reduced by 15% compared with that of a malicious URL detection method based on the black-and-white list technology.

In one embodiment, in step S13, training the classifier for multiple times based on the training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU, and comprises:

the training data set comprises training data and test data;

step S131, inputting training data into a Convolutional Neural Network (CNN) and outputting a pooling sequence matrix;

in this step, the following method can be adopted:

obtaining vector representation of training data according to the glove vocabulary file to form an input matrix; here, each URL in the training data is deconstructed into a combined sample of letters, numbers, and special symbols; performing character string interception and character filling on the combined sample to obtain a coded value; inputting the coded value into an embedding layer for embedding operation to obtain the vector representation of the URL; the vector representations of all the URLs in the training data form the input matrix.

Wherein, carry out character string intercepting and character filling to the combination sample, specifically include: defining the length of the selected URL character as L, judging the length of the character string in the combined sample and the length of the selected URL character, when the length of the character string is greater than or equal to L, only intercepting the front L bit of the character string, and when the length of the character string is less than L, filling the front L bit from the end of the character string to L bit. Here, L may be 50. Fig. 2 shows a schematic diagram of vector representation of URL in training data according to an embodiment of the present application, taking URL "www.queucosm.bid" as an example, the URL is first deconstructed into a combination of letters, numbers and special symbols, and then the length of a character string is calculated to be 17, less than 50, so that a complementary code operation is performed: and (5) filling 0 to 50 bits from the end of the character string. And finally, sequentially obtaining the characteristic vector corresponding to each character according to the glove vocabulary file, and accordingly obtaining the vectorization representation of the URL.

And then, performing dimension reduction processing on the input matrix by adopting a word embedding method to obtain a pooling sequence matrix.

Inputting an input matrix into a convolutional layer, and performing convolution operation on the input matrix through a convolution kernel of a convolutional layer seed to extract characteristics; respectively inputting the extracted features into each pooling layer connected with the convolution layer, and reserving the maximum feature value generated by the convolution kernel in each pooling layer by using a maximum pooling method; and splicing the maximum characteristic numerical values output by all the pooling layers to obtain a pooling sequence matrix.

Step S132, inputting the pooling sequence matrix into a recurrent neural network GRU, and outputting a hidden state sequence;

step S133, converting the hidden state sequence into a classification probability by adopting a Softmax function, judging the URL to be a benign URL when the classification probability is greater than a classifier threshold value, and judging the URL to be a malicious URL when the classification probability is less than the classifier threshold value;

and S134, inputting the test data into a classifier, taking the corresponding classification probability as a classifier threshold when the loss function is minimum, and finishing training to obtain the trained classifier.

In this embodiment, the Adam method is used for training and optimizing the classifier, the initial learning rate is set to 0.005, the learning rate decay rate is set to 0.001, the word vector dimension is set to 50, the batch size batch _ size is 8192, the number of Epoch times is 3, and the rest of the parameters such as the weight and the offset are continuously changed along with the optimization of the classifier.

The present application further provides a learning-based bloom filter-based multi-layer malicious URL identification method, and fig. 3 shows a schematic flow diagram of the learning-based bloom filter-based multi-layer malicious URL identification method according to the embodiment of the present application, where the method includes:

the method adopts a malicious URL identification model for identification, wherein the malicious URL identification model comprises a prefilter, a classifier and a post-filter, and is obtained by applying the malicious URL model construction method of the embodiment;

inputting the URL to be identified into a pre-filter, and judging the URL to be identified as a benign URL or a malicious URL; here, the pre-filter judges the URL to be identified as a benign URL and includes data which misjudges a malicious URL as a benign URL;

if the URL to be identified is a benign URL, inputting the URL to be identified into a classifier, and judging that the URL to be identified is the benign URL or a malicious URL; here, the classifier determines the URL to be identified as a benign URL or a malicious URL, both of which include misjudged data.

Based on the same inventive concept as the malicious URL model construction method, the embodiment of the present application further provides a malicious URL model construction apparatus, and fig. 4 shows a block diagram of a structure of the malicious URL model construction apparatus according to the embodiment of the present application, the apparatus includes:

a data set dividing module 41, configured to divide the URL data set into a pre-filter construction data set and a training data set;

a pre-filter construction module 42, configured to initialize a pre-standard bloom filter, hash-map a pre-filter construction data set to the initialized pre-standard bloom filter, and construct a pre-filter;

a classifier training module 43, configured to train a classifier for multiple times based on a training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;

the post-filter construction module 44 is configured to initialize a post-standard bloom filter, hash and map data of the trained classifier, which is misjudged as a malicious URL, into the initialized post-standard bloom filter, and construct the post-filter;

and a malicious URL identification model constructing module 45, configured to construct a malicious URL identification model from the pre-filter, the trained classifier, and the post-filter.

In this embodiment, a pre-filter is introduced before the classifier to exclude most of the information about malicious URLs; then, the false negative rate is reduced by a classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the false positive probability of the classifier can be reduced by using the pre-filter, the robustness of the model is enhanced, and the space overhead of the post-filter is further reduced; by introducing the deep learning technology, the distribution information of the indexed URL data can be fully utilized, the characteristics can be automatically extracted, the malicious URL detection model is constructed, and under the condition that the given misjudgment rate is 1%, the space overhead is reduced by 15% compared with that of the malicious URL detection method based on the black-and-white list technology.

The malicious URL model building device and the malicious URL model building method have the same inventive concept, the implementation functions of all modules are consistent with those of the malicious URL model building method, and detailed description is omitted here.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for constructing a malicious URL recognition model is characterized by comprising the following steps:

training the classifier for multiple times based on the training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU;

initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a hash manner, and constructing the post-filter;

2. The method of claim 1, wherein partitioning the URL dataset into a pre-filter build dataset and a training dataset comprises:

the URL data set comprises benign URLs and malicious URLs, the benign URLs are divided into two parts according to a first set proportion, namely s1 and s2, the malicious URLs are divided into two parts according to a second set proportion, namely s3 and s4, the s1 forms the pre-filter construction data set, the s2 and the s3 form training data in the training data set, and the s4 forms test data in the training data set.

3. The method of claim 1, wherein initializing a pre-standard bloom filter comprises:

constructing data volume size in data set according to the prefiltern, expected false positive rate F _P Determining the bit number m of the pre-standard bloom filter;

determining the number k of hash rounds when the false positive rate reaches the minimum according to the bit number m and the data size n in the data set constructed by the pre-filter;

and calling a constructor of the function of the pre-standard bloom filter library by taking the bit number m and the hash round number k as parameters, and initializing the pre-standard bloom filter.

4. The method of claim 3, wherein hash mapping the pre-filter build dataset into the initialized pre-standard bloom filter, building a pre-filter, comprises:

performing k rounds of hash calculation on each URL in the pre-filter construction data set to obtain index values of k storage positions, and setting the values of the k storage positions to be 1;

when i =1, calculating a hash value of the URL, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the previous k upper bits to perform modulo calculation on the bit number m to obtain an index value of a storage position corresponding to the ith round;

and when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the first k high bits to perform modulo calculation on the bit number m, and calculating to obtain the index value of the storage position corresponding to the ith round.

5. The method of claim 1, wherein a classifier is trained a plurality of times based on the training data set, resulting in a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU, and comprises:

the training data set comprises training data and test data;

inputting the training data into the convolutional neural network CNN, and outputting a pooling sequence matrix;

inputting the pooling sequence matrix into the recurrent neural network GRU, and outputting a hidden state sequence;

and inputting the test data into the classifier, when the loss function is minimum, taking the corresponding classification probability as a classifier threshold value, and finishing training to obtain the trained classifier.

6. The method of claim 5, wherein inputting the training data to the convolutional neural network CNN, outputting a pooled sequence matrix, comprises:

obtaining vector representation of the training data according to the glove vocabulary file to form an input matrix;

and performing dimensionality reduction on the input matrix by adopting a word embedding method to obtain the pooling sequence matrix.

7. The method of claim 1, wherein initializing a post-standard bloom filter comprises:

according to the data volume n of the data misjudged as the malicious URL by the trained classifier ^′ Expected misjudgment rate F _P ^′ Determining the bit number m of the post-standard bloom filter ^′ ；

According to the bit number m ^′ The size of the data amount n ^′ Determining the number k of hash rounds when the false positive rate reaches the minimum ^′ ；

In the number m of bits ^′ And the number of hash rounds k ^′ And calling a constructor of the function of the post-standard bloom filter library as a parameter to initialize the post-standard bloom filter.

8. The method of claim 7, wherein hashing the data misjudged by the trained classifier as a malicious URL into the initialized post-standard bloom filter, constructing a post-filter, comprises:

k is respectively carried out on each URL in the data which is misjudged as the malicious URL by the trained classifier ^′ Round hash calculation to obtain k ^′ An index value of each storage location, k ^′ The value of each storage location is set to 1;

when i =1, calculating the hash value of the URL, exchanging the upper 15 bits with the lower 15 bits of the hash value, and selecting the front k ^′ The high value pair is the bit number m ^′ Performing modular calculation to obtain an index value of a storage position corresponding to the ith round;

when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, and selecting the front k ^′ The bit number m of the high-order value pair ^′ And performing modular calculation to obtain an index value of a storage position corresponding to the ith wheel.

9. A multi-layer malicious URL identification method based on a learning type bloom filter is characterized by comprising the following steps:

the method adopts a malicious URL identification model for identification, wherein the malicious URL identification model comprises a prefilter, a classifier and a post-filter, and is obtained by applying the malicious URL model construction method of any claim of claims 1 to 8;

inputting a URL to be identified into the pre-filter, and judging that the URL to be identified is a benign URL or a malicious URL;

if the URL to be identified is a benign URL, inputting the URL to be identified into the classifier, and judging that the URL to be identified is the benign URL or a malicious URL;

and if the URL to be identified is a malicious URL, inputting the URL to be identified into the post-filter, and judging that the URL to be identified is a benign URL or a malicious URL.

10. A malicious URL recognition model building device is characterized by comprising the following steps:

the classifier training module is used for training a classifier for multiple times based on the training data set to obtain the trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU;

the post-filter construction module is used for initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a hash way, and constructing the post-filter;

and the malicious URL identification model forming module is used for forming a malicious URL identification model by the pre-filter, the trained classifier and the post-filter.