CN115941327A - Multilayer malicious URL identification method based on learning type bloom filter - Google Patents

Multilayer malicious URL identification method based on learning type bloom filter Download PDF

Info

Publication number
CN115941327A
CN115941327A CN202211570363.XA CN202211570363A CN115941327A CN 115941327 A CN115941327 A CN 115941327A CN 202211570363 A CN202211570363 A CN 202211570363A CN 115941327 A CN115941327 A CN 115941327A
Authority
CN
China
Prior art keywords
url
filter
classifier
post
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211570363.XA
Other languages
Chinese (zh)
Inventor
孟虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202211570363.XA priority Critical patent/CN115941327A/en
Publication of CN115941327A publication Critical patent/CN115941327A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Computer And Data Communications (AREA)

Abstract

The application relates to a multi-layer malicious URL identification method and a malicious URL identification model construction method based on a learning type bloom filter, wherein the model construction method comprises the following steps: initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter; training the classifier for multiple times based on a training data set to obtain a trained classifier; initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a hash manner, and constructing the post-filter. The application introduces a pre-filter to exclude most of the information about malicious URLs; then, the false negative rate is reduced by a classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the use of the pre-filter can reduce the false positive probability of the classifier, enhance the robustness of the model and further reduce the space overhead of the post-filter.

Description

Multilayer malicious URL identification method based on learning type bloom filter
Technical Field
The application relates to the technical field of network information security, in particular to a multi-layer malicious URL identification method based on a learning type bloom filter.
Background
The internet also becomes a platform for network attack due to openness and anonymity while bringing rich resources and convenient services to users. The network attack of lawless persons on individuals can cause huge threats to the security of private information and property, and the network attack aiming at national financial and government affair data platforms can cause irreparable loss. Among the various network security problems, network attacks based on malicious web pages are numerous. The spread range of the malicious web page is wide, and one malicious web page can be accessed by thousands of users within a few minutes. In the big data era, how to realize accurate and quick identification of malicious web pages has become an urgent and challenging task.
At present, the filtering technology of malicious URLs is mainly divided into three categories: the method comprises a black and white list filtering technology, a machine learning-based malicious URL detection method and a deep learning-based malicious URL identification method. However, the black-and-white list technology is essentially a kind of 'after-the-fact' detection, and although the browser can block the loading process of the malicious web page through the technology, the browser can only block the known types, and has serious hysteresis, which easily causes the missing judgment; the malicious URL detection technology based on machine learning needs to artificially extract data characteristics, and is difficult to deal with complex data sets; although the detection method based on deep learning can extract features from the data, when the URL is faced with the URL which is continuously increased and updated rapidly, in order to ensure the accuracy of classification, the data needs to be continuously adjusted to retrain the model, the training process needs to be carried out on hardware devices such as a graphic processor, and the cost is high.
Disclosure of Invention
In order to overcome at least one of the deficiencies in the prior art, embodiments of the present application provide a multi-layer malicious URL identification method based on a learning-type bloom filter.
In a first aspect, a method for constructing a malicious URL identification model is provided, including:
dividing the URL data set into a pre-filter construction data set and a training data set;
initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;
training the classifier for multiple times based on a training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;
initializing a post-standard bloom filter, mapping data which are misjudged as malicious URLs by the trained classifier into the initialized post-standard bloom filter in a hash manner, and constructing the post-filter;
and the pre-filter, the trained classifier and the post-filter form a malicious URL recognition model.
In one embodiment, partitioning the URL data set into a pre-filter build data set and a training data set includes:
the URL data set comprises benign URLs and malicious URLs, the benign URLs are divided into two parts according to a first set proportion, namely s1 and s2, the malicious URLs are divided into two parts according to a second set proportion, namely s3 and s4, the s1 forms a pre-filter construction data set, the s2 and the s3 form training data in a training data set, and the s4 forms test data in the training data set.
In one embodiment, initializing a pre-standard bloom filter includes:
according to the data volume n and the expected misjudgment rate FP in the data set constructed by the pre-filter, determining the bit number m of the pre-standard bloom filter;
determining the number k of hash rounds when the false positive rate reaches the minimum according to the bit number m and the data size n of the data set constructed by the pre-filter;
and calling a constructor of a pre-standard bloom filter library function by taking the bit number m and the hash round number k as parameters, and initializing the pre-standard bloom filter.
In one embodiment, hashing the pre-filter build dataset into an initialized pre-standard bloom filter, the building of the pre-filter comprises:
respectively performing k rounds of hash calculation on each URL in the pre-filter construction dataset to obtain index values of k storage positions, and setting the values of the k storage positions to be 1;
wherein, in the ith round, computing a hash computation for the URL, comprising:
when i =1, calculating a hash value of the URL, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the first k upper bits to perform modulo calculation on the bit number m to obtain an index value of a storage position corresponding to the ith round;
and when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the first k upper bits to perform modulo calculation on the bit number m, and calculating to obtain the index value of the storage position corresponding to the ith round.
In one embodiment, training the classifier for multiple times based on a training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU, and comprises:
the training data set comprises training data and test data;
inputting training data into a convolutional neural network CNN, and outputting a pooling sequence matrix;
inputting the pooling sequence matrix into a recurrent neural network GRU, and outputting a hidden state sequence;
converting the hidden state sequence into classification probability by adopting a Softmax function, judging the URL to be a benign URL when the classification probability is greater than a threshold value of a classifier, and judging the URL to be a malicious URL when the classification probability is smaller than the threshold value of the classifier;
and inputting the test data into a classifier, when the loss function is minimum, taking the corresponding classification probability as a classifier threshold, and finishing training to obtain the trained classifier.
In one embodiment, inputting training data to a convolutional neural network CNN, outputting a pooled sequence matrix, comprising:
obtaining vector representation of training data according to the glove vocabulary file to form an input matrix;
and performing dimension reduction processing on the input matrix by adopting a word embedding method to obtain a pooling sequence matrix.
In one embodiment, initializing a post-standard bloom filter comprises:
misjudging the data quantity n' of the data of the malicious URL according to the trained classifier, and expecting misjudging rate F P ', determining the bit number m' of the post-standard bloom filter;
determining the number k ' of hash rounds when the false positive rate reaches the minimum according to the bit number m ' and the data size n ';
and calling a constructor of a post-standard bloom filter library function by taking the bit number m 'and the hash round number k' as parameters, and initializing the post-standard bloom filter.
In one embodiment, hash mapping the data misjudged as the malicious URL by the trained classifier into an initialized post-standard bloom filter, and constructing the post-filter includes:
respectively performing k ' round hash calculation on each URL in the data which is judged as the malicious URL by the trained classifier to obtain index values of k ' storage positions, and setting the values of the k ' storage positions as 1;
wherein, in the ith round, computing a hash computation for the URL, comprising:
when i =1, calculating a hash value of the URL, exchanging the high 15 bits and the low 15 bits of the hash value, selecting the first k 'high-bit values to perform modulo calculation on a bit number m', and obtaining an index value of a storage position corresponding to the ith round;
and when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the upper k 'bits, and performing modulo calculation on the bit number m' to obtain the index value of the storage position corresponding to the ith round.
In a second aspect, a multi-layer malicious URL identification method based on a learning-type bloom filter is provided, which includes:
the method adopts a malicious URL identification model for identification, wherein the malicious URL identification model comprises a prefilter, a classifier and a post-filter, and is obtained by applying the malicious URL model construction method of any one of claims 1 to 8;
inputting the URL to be identified into a pre-filter, and judging the URL to be identified as a benign URL or a malicious URL;
if the URL to be identified is a benign URL, inputting the URL to be identified into a classifier, and judging that the URL to be identified is the benign URL or a malicious URL;
and if the URL to be identified is a malicious URL, inputting the URL to be identified into a post filter, and judging that the URL to be identified is a benign URL or a malicious URL.
In a third aspect, an apparatus for building a malicious URL identification model is provided, including:
the data set dividing module is used for dividing the URL data set into a pre-filter construction data set and a training data set;
the pre-filter construction module is used for initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;
the classifier training module is used for training the classifier for multiple times based on the training data set to obtain the trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;
the post-filter construction module is used for initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a Hash manner, and constructing the post-filter;
and the malicious URL recognition model forming module is used for forming a malicious URL recognition model by the pre-filter, the trained classifier and the post-filter.
Compared with the prior art, the method has the following beneficial effects:
1. the application introduces a pre-filter before the classifier so as to exclude most of the malicious URLs; then, the false negative rate is reduced by a classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the use of the pre-filter can reduce the false positive probability of the classifier, enhance the robustness of the model and further reduce the space overhead of the post-filter.
2. The method introduces a deep learning technology, can fully utilize the distribution information of the indexed URL data, automatically extracts features, constructs a malicious URL detection model, and reduces the space overhead by 15% compared with a malicious URL detection method based on a black-and-white list technology under the condition that the given misjudgment rate is 1%.
Drawings
The present application may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, along with the detailed description below. In the drawings:
FIG. 1 is a flow chart diagram illustrating a method for constructing a malicious URL identification model according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram illustrating vector representation of URLs in training data according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a multi-layer malicious URL identification method based on a learning-based bloom filter according to an embodiment of the present disclosure;
fig. 4 shows a block diagram of a malicious URL model building apparatus according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Here, it should be further noted that, in order to avoid obscuring the present application with unnecessary details, only the device structure closely related to the solution according to the present application is shown in the drawings, and other details not so related to the present application are omitted.
It is to be understood that the application is not limited to the described embodiments, since the description proceeds with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.
Aiming at the problems of high false judgment rate and high space overhead in the existing malicious URL identification method, the application provides a multi-layer malicious URL identification method based on a learning type bloom filter.
Fig. 1 shows a flow chart of a method for constructing a malicious URL identification model according to an embodiment of the present application, where the method includes:
s11, dividing the URL data set into a pre-filter construction data set and a training data set;
specifically, in this step, the URL data set includes a benign URL and a malicious URL, and the benign URL is expressed as 3:7, dividing the ratio of the malicious URL into two parts, namely s1 and s2, and dividing the malicious URL into two parts according to the ratio of 9: the scale of 1 is divided into two parts, s3 and s4, respectively, s1 constituting the prefilter construction data set, s2 and s3 constituting the training data in the training data set, and s4 constituting the test data in the training data set.
Step S12, initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;
specifically, initializing a pre-standard bloom filter includes:
constructing the data volume n and the expected misjudgment rate F in the data set according to the prefilter P Determining the bit number m of the pre-standard bloom filter; here, the number m of bits of the pre-standard bloom filter may be determined using the following formula:
Figure SMS_1
determining the number k of hash rounds when the false positive rate reaches the minimum according to the bit number m and the data size n of the data set constructed by the pre-filter; here, the hash round number k may be determined using the following equation:
Figure SMS_2
and calling a constructor of a pre-standard bloom filter library function by taking the bit number m and the hash round number k as parameters, and initializing the pre-standard bloom filter.
Specifically, hash mapping the pre-filter construction data set into an initialized pre-standard bloom filter, and constructing the pre-filter, includes:
respectively performing k rounds of hash calculation on each URL in the pre-filter construction data set to obtain index values of k storage positions, and setting the values of the k storage positions to be 1; the index values are used for determining the positions of the storage positions, wherein the values of the storage positions in the initialized pre-standard bloom filter are all 0, when the index values of k storage positions are obtained through calculation, the values of k storage positions are set to be 1, if the value of a certain storage position is already set to be 1 in the previous i-1 round of calculation in the ith round of calculation, and when the index value of the storage position is obtained again in the ith round of calculation, the value of the storage position is not modified.
Wherein, in the ith round, computing a hash computation for the URL, comprising:
when i =1, calculating the hash value of the URL by adopting a MurMurHash2 hash algorithm, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the first k upper bits to perform modulo calculation on the bit number m, and obtaining the index value of the storage position corresponding to the ith round;
and when i is larger than or equal to 2, calculating the hash value of the URL by adopting a MurMurHash2 hash algorithm, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the first k high bits to perform modulo calculation on the bit number m, and obtaining the index value of the storage position corresponding to the ith round.
S13, training the classifier for multiple times based on the training data set to obtain the trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;
step S14, initializing a post-standard bloom filter, mapping the data which is misjudged as malicious URL by the trained classifier into the initialized post-standard bloom filter in a Hash manner, and constructing the post-standard bloom filter;
specifically, initializing a post-standard bloom filter includes:
misjudging the data quantity n' of the data of the malicious URL according to the trained classifier, and expecting misjudging rate F P ', determining the bit number m ' of the post-standard bloom filter, wherein n ' = nF n ,F n As the probability of the data being misinterpreted as a malicious URL by the classifier, here, the number of bits m' may be determined using the following formula:
Figure SMS_3
determining the number k ' of hash rounds when the false positive rate reaches the minimum according to the bit number m ' and the data size n '; here, the hash round number k' may be determined using the following equation:
Figure SMS_4
and calling a constructor of a post-standard bloom filter library function by taking the bit number m 'and the hash round number k' as parameters, and initializing the post-standard bloom filter.
Specifically, the data hash mapping that the trained classifier misjudges as the malicious URL is mapped into the initialized post-standard bloom filter, and the post-filter is constructed, including:
respectively performing k ' round hash calculation on each URL in the data which is judged as the malicious URL by the trained classifier to obtain index values of k ' storage positions, and setting the values of the k ' storage positions as 1;
wherein, in the ith round, computing a hash computation for the URL, comprising:
when i =1, calculating a hash value of the URL by adopting a MurMurHash2 hash algorithm, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the first k 'upper bits to perform modulo calculation on a bit number m', and obtaining an index value of a storage position corresponding to the ith round;
and when i is larger than or equal to 2, calculating the hash value of the URL by adopting a MurMurHash2 hash algorithm, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, and performing modulo calculation on the bit number m 'of the value of the upper k' bits to obtain the index value of the storage position corresponding to the ith round.
And S15, forming a malicious URL recognition model by the pre-filter, the trained classifier and the post-filter.
In the above embodiment of the present application, a pre-filter is introduced before the classifier, so as to exclude most of the malicious URLs; then, the false negative rate is reduced by the classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the false positive probability of the classifier can be reduced by using the pre-filter, the robustness of the model is enhanced, and the space overhead of the post-filter is further reduced; by introducing the deep learning technology, the distribution information of the indexed URL data can be fully utilized, the characteristics can be automatically extracted, the malicious URL detection model is constructed, and under the condition that the given misjudgment rate is 1%, the space overhead is reduced by 15% compared with that of a malicious URL detection method based on the black-and-white list technology.
In one embodiment, in step S13, training the classifier for multiple times based on the training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU, and comprises:
the training data set comprises training data and test data;
step S131, inputting training data into a Convolutional Neural Network (CNN) and outputting a pooling sequence matrix;
in this step, the following method can be adopted:
obtaining vector representation of training data according to the glove vocabulary file to form an input matrix; here, each URL in the training data is deconstructed into a combined sample of letters, numbers, and special symbols; performing character string interception and character filling on the combined sample to obtain a coded value; inputting the coded value into an embedding layer for embedding operation to obtain the vector representation of the URL; the vector representations of all the URLs in the training data form the input matrix.
Wherein, carry out character string intercepting and character filling to the combination sample, specifically include: defining the length of the selected URL character as L, judging the length of the character string in the combined sample and the length of the selected URL character, when the length of the character string is greater than or equal to L, only intercepting the front L bit of the character string, and when the length of the character string is less than L, filling the front L bit from the end of the character string to L bit. Here, L may be 50. Fig. 2 shows a schematic diagram of vector representation of URL in training data according to an embodiment of the present application, taking URL "www.queucosm.bid" as an example, the URL is first deconstructed into a combination of letters, numbers and special symbols, and then the length of a character string is calculated to be 17, less than 50, so that a complementary code operation is performed: and (5) filling 0 to 50 bits from the end of the character string. And finally, sequentially obtaining the characteristic vector corresponding to each character according to the glove vocabulary file, and accordingly obtaining the vectorization representation of the URL.
And then, performing dimension reduction processing on the input matrix by adopting a word embedding method to obtain a pooling sequence matrix.
Inputting an input matrix into a convolutional layer, and performing convolution operation on the input matrix through a convolution kernel of a convolutional layer seed to extract characteristics; respectively inputting the extracted features into each pooling layer connected with the convolution layer, and reserving the maximum feature value generated by the convolution kernel in each pooling layer by using a maximum pooling method; and splicing the maximum characteristic numerical values output by all the pooling layers to obtain a pooling sequence matrix.
Step S132, inputting the pooling sequence matrix into a recurrent neural network GRU, and outputting a hidden state sequence;
step S133, converting the hidden state sequence into a classification probability by adopting a Softmax function, judging the URL to be a benign URL when the classification probability is greater than a classifier threshold value, and judging the URL to be a malicious URL when the classification probability is less than the classifier threshold value;
and S134, inputting the test data into a classifier, taking the corresponding classification probability as a classifier threshold when the loss function is minimum, and finishing training to obtain the trained classifier.
In this embodiment, the Adam method is used for training and optimizing the classifier, the initial learning rate is set to 0.005, the learning rate decay rate is set to 0.001, the word vector dimension is set to 50, the batch size batch _ size is 8192, the number of Epoch times is 3, and the rest of the parameters such as the weight and the offset are continuously changed along with the optimization of the classifier.
The present application further provides a learning-based bloom filter-based multi-layer malicious URL identification method, and fig. 3 shows a schematic flow diagram of the learning-based bloom filter-based multi-layer malicious URL identification method according to the embodiment of the present application, where the method includes:
the method adopts a malicious URL identification model for identification, wherein the malicious URL identification model comprises a prefilter, a classifier and a post-filter, and is obtained by applying the malicious URL model construction method of the embodiment;
inputting the URL to be identified into a pre-filter, and judging the URL to be identified as a benign URL or a malicious URL; here, the pre-filter judges the URL to be identified as a benign URL and includes data which misjudges a malicious URL as a benign URL;
if the URL to be identified is a benign URL, inputting the URL to be identified into a classifier, and judging that the URL to be identified is the benign URL or a malicious URL; here, the classifier determines the URL to be identified as a benign URL or a malicious URL, both of which include misjudged data.
And if the URL to be identified is a malicious URL, inputting the URL to be identified into a post filter, and judging that the URL to be identified is a benign URL or a malicious URL.
Based on the same inventive concept as the malicious URL model construction method, the embodiment of the present application further provides a malicious URL model construction apparatus, and fig. 4 shows a block diagram of a structure of the malicious URL model construction apparatus according to the embodiment of the present application, the apparatus includes:
a data set dividing module 41, configured to divide the URL data set into a pre-filter construction data set and a training data set;
a pre-filter construction module 42, configured to initialize a pre-standard bloom filter, hash-map a pre-filter construction data set to the initialized pre-standard bloom filter, and construct a pre-filter;
a classifier training module 43, configured to train a classifier for multiple times based on a training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a cyclic neural network GRU;
the post-filter construction module 44 is configured to initialize a post-standard bloom filter, hash and map data of the trained classifier, which is misjudged as a malicious URL, into the initialized post-standard bloom filter, and construct the post-filter;
and a malicious URL identification model constructing module 45, configured to construct a malicious URL identification model from the pre-filter, the trained classifier, and the post-filter.
In this embodiment, a pre-filter is introduced before the classifier to exclude most of the information about malicious URLs; then, the false negative rate is reduced by a classifier, and finally, the false negative rate is ensured to be 0 by using a post-filter; the false positive probability of the classifier can be reduced by using the pre-filter, the robustness of the model is enhanced, and the space overhead of the post-filter is further reduced; by introducing the deep learning technology, the distribution information of the indexed URL data can be fully utilized, the characteristics can be automatically extracted, the malicious URL detection model is constructed, and under the condition that the given misjudgment rate is 1%, the space overhead is reduced by 15% compared with that of the malicious URL detection method based on the black-and-white list technology.
The malicious URL model building device and the malicious URL model building method have the same inventive concept, the implementation functions of all modules are consistent with those of the malicious URL model building method, and detailed description is omitted here.
The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for constructing a malicious URL recognition model is characterized by comprising the following steps:
dividing the URL data set into a pre-filter construction data set and a training data set;
initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;
training the classifier for multiple times based on the training data set to obtain a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU;
initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a hash manner, and constructing the post-filter;
and the pre-filter, the trained classifier and the post-filter form a malicious URL recognition model.
2. The method of claim 1, wherein partitioning the URL dataset into a pre-filter build dataset and a training dataset comprises:
the URL data set comprises benign URLs and malicious URLs, the benign URLs are divided into two parts according to a first set proportion, namely s1 and s2, the malicious URLs are divided into two parts according to a second set proportion, namely s3 and s4, the s1 forms the pre-filter construction data set, the s2 and the s3 form training data in the training data set, and the s4 forms test data in the training data set.
3. The method of claim 1, wherein initializing a pre-standard bloom filter comprises:
constructing data volume size in data set according to the prefiltern, expected false positive rate F P Determining the bit number m of the pre-standard bloom filter;
determining the number k of hash rounds when the false positive rate reaches the minimum according to the bit number m and the data size n in the data set constructed by the pre-filter;
and calling a constructor of the function of the pre-standard bloom filter library by taking the bit number m and the hash round number k as parameters, and initializing the pre-standard bloom filter.
4. The method of claim 3, wherein hash mapping the pre-filter build dataset into the initialized pre-standard bloom filter, building a pre-filter, comprises:
performing k rounds of hash calculation on each URL in the pre-filter construction data set to obtain index values of k storage positions, and setting the values of the k storage positions to be 1;
wherein, in the ith round, computing a hash computation for the URL, comprising:
when i =1, calculating a hash value of the URL, exchanging the upper 15 bits and the lower 15 bits of the hash value, selecting the previous k upper bits to perform modulo calculation on the bit number m to obtain an index value of a storage position corresponding to the ith round;
and when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, selecting the first k high bits to perform modulo calculation on the bit number m, and calculating to obtain the index value of the storage position corresponding to the ith round.
5. The method of claim 1, wherein a classifier is trained a plurality of times based on the training data set, resulting in a trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU, and comprises:
the training data set comprises training data and test data;
inputting the training data into the convolutional neural network CNN, and outputting a pooling sequence matrix;
inputting the pooling sequence matrix into the recurrent neural network GRU, and outputting a hidden state sequence;
converting the hidden state sequence into classification probability by adopting a Softmax function, judging the URL to be a benign URL when the classification probability is greater than a threshold value of a classifier, and judging the URL to be a malicious URL when the classification probability is smaller than the threshold value of the classifier;
and inputting the test data into the classifier, when the loss function is minimum, taking the corresponding classification probability as a classifier threshold value, and finishing training to obtain the trained classifier.
6. The method of claim 5, wherein inputting the training data to the convolutional neural network CNN, outputting a pooled sequence matrix, comprises:
obtaining vector representation of the training data according to the glove vocabulary file to form an input matrix;
and performing dimensionality reduction on the input matrix by adopting a word embedding method to obtain the pooling sequence matrix.
7. The method of claim 1, wherein initializing a post-standard bloom filter comprises:
according to the data volume n of the data misjudged as the malicious URL by the trained classifier Expected misjudgment rate F P Determining the bit number m of the post-standard bloom filter
According to the bit number m The size of the data amount n Determining the number k of hash rounds when the false positive rate reaches the minimum
In the number m of bits And the number of hash rounds k And calling a constructor of the function of the post-standard bloom filter library as a parameter to initialize the post-standard bloom filter.
8. The method of claim 7, wherein hashing the data misjudged by the trained classifier as a malicious URL into the initialized post-standard bloom filter, constructing a post-filter, comprises:
k is respectively carried out on each URL in the data which is misjudged as the malicious URL by the trained classifier Round hash calculation to obtain k An index value of each storage location, k The value of each storage location is set to 1;
wherein, in the ith round, computing a hash computation for the URL, comprising:
when i =1, calculating the hash value of the URL, exchanging the upper 15 bits with the lower 15 bits of the hash value, and selecting the front k The high value pair is the bit number m Performing modular calculation to obtain an index value of a storage position corresponding to the ith round;
when i is larger than or equal to 2, calculating the hash value of the URL, accumulating the hash value of the URL obtained in the ith round and the hash value of the URL obtained in the (i-1) th round to obtain an accumulated hash value, exchanging the upper 15 bits and the lower 15 bits of the accumulated hash value, and selecting the front k The bit number m of the high-order value pair And performing modular calculation to obtain an index value of a storage position corresponding to the ith wheel.
9. A multi-layer malicious URL identification method based on a learning type bloom filter is characterized by comprising the following steps:
the method adopts a malicious URL identification model for identification, wherein the malicious URL identification model comprises a prefilter, a classifier and a post-filter, and is obtained by applying the malicious URL model construction method of any claim of claims 1 to 8;
inputting a URL to be identified into the pre-filter, and judging that the URL to be identified is a benign URL or a malicious URL;
if the URL to be identified is a benign URL, inputting the URL to be identified into the classifier, and judging that the URL to be identified is the benign URL or a malicious URL;
and if the URL to be identified is a malicious URL, inputting the URL to be identified into the post-filter, and judging that the URL to be identified is a benign URL or a malicious URL.
10. A malicious URL recognition model building device is characterized by comprising the following steps:
the data set dividing module is used for dividing the URL data set into a pre-filter construction data set and a training data set;
the pre-filter construction module is used for initializing a pre-standard bloom filter, and mapping the pre-filter construction data set into the initialized pre-standard bloom filter in a hash mode to construct a pre-filter;
the classifier training module is used for training a classifier for multiple times based on the training data set to obtain the trained classifier; the classifier comprises a convolutional neural network CNN and a recurrent neural network GRU;
the post-filter construction module is used for initializing a post-standard bloom filter, mapping the data which is misjudged as the malicious URL by the trained classifier into the initialized post-standard bloom filter in a hash way, and constructing the post-filter;
and the malicious URL identification model forming module is used for forming a malicious URL identification model by the pre-filter, the trained classifier and the post-filter.
CN202211570363.XA 2022-12-08 2022-12-08 Multilayer malicious URL identification method based on learning type bloom filter Pending CN115941327A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211570363.XA CN115941327A (en) 2022-12-08 2022-12-08 Multilayer malicious URL identification method based on learning type bloom filter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211570363.XA CN115941327A (en) 2022-12-08 2022-12-08 Multilayer malicious URL identification method based on learning type bloom filter

Publications (1)

Publication Number Publication Date
CN115941327A true CN115941327A (en) 2023-04-07

Family

ID=86651796

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211570363.XA Pending CN115941327A (en) 2022-12-08 2022-12-08 Multilayer malicious URL identification method based on learning type bloom filter

Country Status (1)

Country Link
CN (1) CN115941327A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators
CN108427928A (en) * 2018-03-16 2018-08-21 华鼎世纪(北京)国际科技有限公司 The detection method and device of anomalous event in monitor video
US20180357434A1 (en) * 2017-06-08 2018-12-13 The Government Of The United States, As Represented By The Secretary Of The Army Secure Generalized Bloom Filter
CN110134693A (en) * 2019-05-17 2019-08-16 南京大学 Temporal index method for building up based on Hash and PCA
CA3057038A1 (en) * 2018-09-29 2020-03-29 10353744 Canada Ltd. Data filtering method, apparatus, electronic apparatus and storage medium
CN111611348A (en) * 2020-05-25 2020-09-01 河南科技大学 ICN network information name searching method based on learning bloom filter
CN112162975A (en) * 2020-09-25 2021-01-01 华南理工大学 Method for realizing repeated data deletion technology based on single-hash equal-distribution bloom filter
US20210141922A1 (en) * 2018-07-27 2021-05-13 Huawei Technologies Co., Ltd. Privacy data reporting method and apparatus, and storage medium
CN113051498A (en) * 2021-03-22 2021-06-29 全球能源互联网研究院有限公司 URL duplicate removal method and system based on multiple bloom filtering
WO2021143016A1 (en) * 2020-01-15 2021-07-22 平安科技(深圳)有限公司 Approximate data processing method and apparatus, medium and electronic device
CN114022279A (en) * 2021-11-05 2022-02-08 税友软件集团股份有限公司 Service data error correction method, device, equipment and readable storage medium
CN114328522A (en) * 2021-12-26 2022-04-12 浪潮云信息技术股份公司 Application method of book distinguishing mode based on bloom filter in digital library

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators
US20180357434A1 (en) * 2017-06-08 2018-12-13 The Government Of The United States, As Represented By The Secretary Of The Army Secure Generalized Bloom Filter
CN108427928A (en) * 2018-03-16 2018-08-21 华鼎世纪(北京)国际科技有限公司 The detection method and device of anomalous event in monitor video
US20210141922A1 (en) * 2018-07-27 2021-05-13 Huawei Technologies Co., Ltd. Privacy data reporting method and apparatus, and storage medium
CA3057038A1 (en) * 2018-09-29 2020-03-29 10353744 Canada Ltd. Data filtering method, apparatus, electronic apparatus and storage medium
CN110134693A (en) * 2019-05-17 2019-08-16 南京大学 Temporal index method for building up based on Hash and PCA
WO2021143016A1 (en) * 2020-01-15 2021-07-22 平安科技(深圳)有限公司 Approximate data processing method and apparatus, medium and electronic device
CN111611348A (en) * 2020-05-25 2020-09-01 河南科技大学 ICN network information name searching method based on learning bloom filter
CN112162975A (en) * 2020-09-25 2021-01-01 华南理工大学 Method for realizing repeated data deletion technology based on single-hash equal-distribution bloom filter
CN113051498A (en) * 2021-03-22 2021-06-29 全球能源互联网研究院有限公司 URL duplicate removal method and system based on multiple bloom filtering
CN114022279A (en) * 2021-11-05 2022-02-08 税友软件集团股份有限公司 Service data error correction method, device, equipment and readable storage medium
CN114328522A (en) * 2021-12-26 2022-04-12 浪潮云信息技术股份公司 Application method of book distinguishing mode based on bloom filter in digital library

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
K. NANDHINI等: "Malicious Website Detection Using Probabilistic Data Structure Bloom Filter", 《2019 3RD INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC)》, 29 August 2019 (2019-08-29) *
WEIPENG ZHOU等: "An Improved Bloom Filter in Distributed Crawler", 《2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION 》, 6 December 2018 (2018-12-06) *
刘健;赵刚;郑运鹏;: "恶意URL多层过滤检测模型的设计与实现", 信息网络安全, no. 01, 10 January 2016 (2016-01-10) *
刘邦国;陈庆春;类先富;: "一种面向PDF文本内容审查的高效多模式匹配算法", 计算机应用研究, no. 06, 30 May 2020 (2020-05-30) *
王伟晨;: "基于布隆过滤器算法的数据检索误判率研究", 计算机产品与流通, no. 03, 15 March 2020 (2020-03-15) *
茅潇潇;段惠超;高明;: "OceanBase中基于布隆过滤器的连接算法", 华东师范大学学报(自然科学版), no. 05, 30 September 2016 (2016-09-30) *

Similar Documents

Publication Publication Date Title
CN109101552B (en) Phishing website URL detection method based on deep learning
CN112019651B (en) DGA domain name detection method using depth residual error network and character-level sliding window
CN110602113B (en) Hierarchical phishing website detection method based on deep learning
CN106709345A (en) Deep learning method-based method and system for deducing malicious code rules and equipment
CN112073551B (en) DGA domain name detection system based on character-level sliding window and depth residual error network
CN112073550B (en) DGA domain name detection method fusing character-level sliding window and depth residual error network
CN109005145A (en) A kind of malice URL detection system and its method extracted based on automated characterization
CN109450845A (en) A kind of algorithm generation malice domain name detection method based on deep neural network
CN113297572B (en) Deep learning sample-level anti-attack defense method and device based on neuron activation mode
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN113315789B (en) Web attack detection method and system based on multi-level combined network
CN112217787B (en) Method and system for generating mock domain name training data based on ED-GAN
CN114283341A (en) High-transferability confrontation sample generation method, system and terminal
CN110351291A (en) Ddos attack detection method and device based on multiple dimensioned convolutional neural networks
CN116311214B (en) License plate recognition method and device
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN113269228A (en) Method, device and system for training graph network classification model and electronic equipment
CN113409157B (en) Cross-social network user alignment method and device
CN113947579A (en) Confrontation sample detection method for image target detection neural network
CN116318845B (en) DGA domain name detection method under unbalanced proportion condition of positive and negative samples
CN116962047A (en) Interpretable threat information generation method, system and device
CN115941327A (en) Multilayer malicious URL identification method based on learning type bloom filter
CN109101984A (en) A kind of image-recognizing method and device based on convolutional neural networks
CN114844682B (en) DGA domain name detection method and system
CN116188439A (en) False face-changing image detection method and device based on identity recognition probability distribution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination