CN110912917A

CN110912917A - Malicious URL detection method and system

Info

Publication number: CN110912917A
Application number: CN201911207542.5A
Authority: CN
Inventors: 熊骁; 郭岗; 林飞; 古元; 沈智杰; 景晓军
Original assignee: Beijing Asia Century Technology Development Co Ltd; Shenzhen Science And Technology Development Co Ltd Surfilter
Current assignee: Beijing Asia Century Technology Development Co Ltd; Shenzhen Science And Technology Development Co Ltd Surfilter
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-24

Abstract

The invention provides a malicious URL detection method and system. The malicious URL detection method comprises the following steps: acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set; clustering the URL data set to be labeled by adopting a clustering algorithm so as to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; and subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set. The malicious URL detection method and the system are novel in design and high in practicability.

Description

Malicious URL detection method and system

Technical Field

The invention relates to the technical field of network information security, in particular to a malicious URL detection method and system.

Background

With the rapid development of the internet, more and more malicious URL attacks appear, and the network security is seriously threatened. Conventional URL attack detection systems are primarily through the use of blacklists or rule lists. These lists or rule lists will become longer and longer, and it is not practical to protect against all attacks in these ways. More seriously, these methods are difficult to detect potential threats and it is difficult for network security engineers to effectively discover new malicious URL attacks.

To improve the generalization ability of the algorithm, many researchers have adopted a machine learning-based approach to accomplish this task. These methods are mainly divided into two categories: firstly, in an unsupervised mode, such as an anomaly detection technology, the method does not need to label data; however, the requirement of the model for the input features is far higher than that of a general supervised model, and the performance of the top of the score is difficult to maintain under the condition of a slightly larger number of features. And secondly, in a supervision mode, manual labeling is carried out based on human business experience, and then supervised learning is carried out based on labeling to obtain a model, but the labeling cost is high, and the labeling experts have artificial subjectivity errors, so that the accuracy is reduced.

Supervised learning methods generally achieve greater generalization capabilities when labeled data is available. However, in many cases, it is difficult to obtain accurate annotation data. More than that time, we may get only a small fraction of malicious URLs and a large number of unlabeled URL samples, lacking sufficiently reliable negative examples, which means we cannot directly use the above-mentioned machine learning algorithm. If we simply resolve it unsupervised, then the annotation information for known malicious URLs is difficult to exploit and may not achieve satisfactory performance.

Disclosure of Invention

The invention provides a malicious URL detection method and system aiming at the technical problems.

The technical scheme provided by the invention is as follows:

the invention provides a malicious URL detection method, which comprises the following steps:

s1, acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;

s2, clustering the URL data set to be labeled by adopting a clustering algorithm to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;

and step S3, training the SVM support vector machine by using the updated malicious URL training sample set to classify the updated URL test data set and outputting a label-free URL data set.

In the malicious URL detection method, the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation maximization clustering algorithm, a coacervation level clustering algorithm or a graph group detection method.

In the malicious URL detection method, step S2 adopts a Mini Batch K mean algorithm to cluster URL data sets to be labeled, so as to obtain URL sample sets to be labeled.

In the above malicious URL detection method of the present invention, step S3 further includes: and subtracting the unmarked URL data set from the URL data set to be analyzed, thereby obtaining a final malicious URL data set.

The invention also provides a malicious URL detection system, which comprises:

the active learning module is used for acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;

the labeling module is used for clustering the URL data set to be labeled by adopting a clustering algorithm so as to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;

and the output module is used for training the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set and outputting the unmarked URL data set.

In the malicious URL detection system, the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation maximization clustering algorithm, a coacervation level clustering algorithm or a graph group detection method.

In the malicious URL detection system, the labeling module is further used for clustering the URL data sets to be labeled by adopting a Mini Batch K mean algorithm so as to obtain URL sample sets to be labeled;

in the malicious URL detection system, the output module is used for training the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set and outputting the unmarked URL data set.

The malicious URL detection method and the system can effectively find potential malicious URL attacks, can be used as auxiliary deployment of the existing system, and can also be used for helping network security engineers to effectively find potential attack modes, so that the potential malicious URL attacks can be quickly updated to the existing system. The malicious URL detection method and the system are novel in design and high in practicability.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

fig. 1 is a flowchart illustrating a malicious URL detection method of step S1 according to a preferred embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a change of a processing result of step S1 of the malicious URL detection method illustrated in fig. 1;

fig. 3 is a schematic diagram illustrating a processing result change of the malicious URL detection method according to the preferred embodiment of the present invention;

fig. 4 is a functional block diagram of a malicious URL detection system according to a preferred embodiment of the present invention.

Detailed Description

The technical problem to be solved by the invention is as follows: when URL detection is performed, only a small part of malicious URLs and a large number of unlabeled URL samples are usually obtained, and a sufficiently reliable negative sample is lacking, which means that we cannot directly use a conventional machine learning algorithm. If we simply resolve it unsupervised, then the annotation information for known malicious URLs is difficult to exploit and may not achieve satisfactory performance. The technical idea of the invention for solving the technical problem is as follows: a malicious URL detection method and a system are constructed, and Active Learning (AL for short) is combined with semi-supervised (PU for short). Under the condition that the workload of manual labeling is limited, a malicious URL detection model is developed for a URL data set, and under the same accuracy rate, compared with an unsupervised model and a semi-supervised model, the malicious URL identification amount is greatly improved.

In order to make the technical purpose, technical solutions and technical effects of the present invention more clear and facilitate those skilled in the art to understand and implement the present invention, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

The preferred embodiment of the invention provides a malicious URL detection method, which comprises the following steps:

the method adopts an active learning method, and marks the URL to-be-analyzed data set by taking the malicious URL training sample set as a label. Preferably, in the process of classifying the data set to be analyzed of the URL by training the SVM support vector machine with the malicious URL training sample set, a data labeling expert is further used for supervision and optimization iteration to ensure the accuracy of the label. For example, in fig. 1, it is assumed that the URL data set to be analyzed is original unlabeled data x1, x2, and x3 … …, the SVM support vector machine is an Active Learning classifier, the original unlabeled data x1, x2, and x3 … … are labeled by the Active Learning classifier, and in the labeling process, a data labeling expert performs supervision and optimization iteration.

The scheme involved in step S1 does not limit the specific type of the Active Learning classifier in fig. 1, and supervised classification, in which a URL to-be-analyzed data set is directly subjected to secondary classification according to a malicious URL training sample set, is the simplest and most direct method. In order to improve the classification efficiency and accuracy, the malicious URL detection method of the present invention introduces step S2 and step S3.

in this step, the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, an expectation maximization clustering algorithm based on a gaussian mixture model, a coacervation level clustering algorithm, or a graph group detection method.

Preferably, in this embodiment, in this step, a Mini Batch K mean algorithm is adopted to cluster the URL data set to be labeled, so as to obtain a URL sample set to be labeled;

the Mini Batch K-means algorithm is a clustering model which can keep clustering accuracy as much as possible and can greatly reduce computing time, the Mini Batch is adopted to reduce the computing time, and meanwhile, an objective function is tried to be optimized. The MiniBatch refers to a data subset which is randomly extracted each time the algorithm is trained, and the randomly selected data are adopted for training, so that the calculation time is greatly reduced, and the convergence time of the K-means algorithm is reduced.

Specifically, the sampling mode is based on the uncertainties & Diversity standard, that is, a sample set with the most uncertain current model and rich Diversity is taken as much as possible. The specific process is as follows: 1) scoring the new data Dnew by using a current model; 2) extracting a plurality of white samples with most uncertain models to form Duncertain, wherein the uncertainty is measured based on model scoring; 3) and (3) carrying out K-Means (K-Means) clustering on Duncertain, and taking out a plurality of most uncertain samples in each class to form a URL sample set to be labeled.

Furthermore, the URL sample set to be labeled is labeled according to the judgment result of whether the URL sample set is malicious or not, so that the URL sample set is divided into a labeled malicious URL sample set and a non-labeled URL sample set. And for the samples which cannot be determined by the expert in the URL sample set to be labeled, summarizing the samples in the URL sample set without being labeled in case that the samples are not labeled.

As shown in fig. 2, the expert may mark for multiple times, gradually expand the L set, and continuously improve the performance of the expert when learning the L set for multiple times. In the malicious URL detection method of the present invention, as shown in fig. 3, the expert labels and expands the P set for multiple times, and updates the learning at each iteration.

The malicious URL detection method can grow a new model on the basis of the existing knowledge (namely the malicious URL training sample set), so that black sample labeling (namely the malicious URL data set) with high accuracy and low recall rate can be brought by the existing knowledge. The malicious URL detection method provided by the invention can be divided into two steps, wherein the first step is step S1, a sample of a malicious URL training sample set is taken as spy and mixed into a URL data set to be analyzed, and multiple rounds of EM iteration are carried out, the second step comprises step S2 and step S3, all samples with the scores smaller than the score of a 10% quantile model in spy in the URL data set to be analyzed are marked by investigating the score distribution of spy samples, and are summarized in an updated URL test data set, and multiple rounds of EM iteration are carried out based on the updated URL test data set.

EM may be understood as an improved method of mle (maximum Likelihood estimation) in the presence of hidden variables, where the missing values are filled in step E, and step M iterates based on the last filling result, so that the final model is generated after many rounds.

Further, in this embodiment, the Active Learning classifier adopts a GBRT (gradient boosting regression tree) based classifier, so that after the malicious URL detection method is executed once, a GBRT model is generated.

Further, step S3 further includes: and subtracting the unmarked URL data set from the URL data set to be analyzed, thereby obtaining a final malicious URL data set.

Further, as shown in fig. 4, fig. 4 is a functional module diagram of a malicious URL detection system according to a preferred embodiment of the present invention. Specifically, malicious URL detection system includes:

the active learning module 100 is configured to obtain a URL data set to be analyzed and obtain a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;

the active learning module 100 marks the URL to-be-analyzed data set by using a malicious URL training sample set as a tag by using an active learning method. Preferably, in the process of classifying the data set to be analyzed of the URL by training the SVM support vector machine with the malicious URL training sample set, a data labeling expert is further used for supervision and optimization iteration to ensure the accuracy of the label. For example, in fig. 1, it is assumed that the URL data set to be analyzed is original unlabeled data x1, x2, and x3 … …, the SVM support vector machine is an Active Learning classifier, the original unlabeled data x1, x2, and x3 … … are labeled by the Active Learning classifier, and in the labeling process, a data labeling expert performs supervision and optimization iteration.

The Active Learning module 100 does not limit the specific types of the Active Learning classifiers in fig. 1 during the work, and supervised classification, in which a URL to be analyzed is directly subjected to secondary classification according to a malicious URL training sample set, is the simplest and most direct method.

The labeling module 200 is configured to cluster the URL data set to be labeled by using a clustering algorithm, so as to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;

the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, an expectation maximization clustering algorithm based on a Gaussian mixture model, a coacervation level clustering algorithm or a graph group detection method.

Preferably, in this embodiment, the tagging module 200 is further configured to cluster the URL data set to be tagged by using a Mini Batch K-means algorithm, so as to obtain a URL sample set to be tagged;

And the output module 300 is configured to train the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set, and output a non-labeled URL data set.

It is understood that the output module 300 is further configured to subtract the unmarked URL data set from the data set to be analyzed, so as to obtain a final malicious URL data set.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A malicious URL detection method is characterized by comprising the following steps:

2. The malicious URL detection method according to claim 1, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation-maximization clustering algorithm, a coacervation hierarchy clustering algorithm, or a graph community detection method.

3. The method according to claim 2, wherein in step S2, the Mini BatchK mean algorithm is adopted to cluster the URL data sets to be labeled, so as to obtain the URL sample sets to be labeled.

4. The malicious URL detection method according to claim 1, wherein the step S3 further comprises: and subtracting the unmarked URL data set from the URL data set to be analyzed, thereby obtaining a final malicious URL data set.

5. A malicious URL detection system, comprising:

the active learning module (100) is used for acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;

the labeling module (200) is used for clustering the URL data sets to be labeled by adopting a clustering algorithm so as to obtain URL sample sets to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;

and the output module (300) is used for training the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set and outputting the unmarked URL data set.

6. The malicious URL detection system according to claim 5, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation-maximization clustering algorithm, a coacervation hierarchy clustering algorithm, or a graph community detection method.

7. The malicious URL detection system according to claim 6, wherein the labeling module (200) is further configured to cluster the URL data sets to be labeled by using a Mini Batch K-means algorithm, so as to obtain URL sample sets to be labeled.

8. The malicious URL detection system according to claim 5, wherein the output module (300) is configured to train an SVM support vector machine with the updated malicious URL training sample set to classify the updated URL test data set and output the unlabeled URL data set.