CN110912917A - Malicious URL detection method and system - Google Patents

Malicious URL detection method and system Download PDF

Info

Publication number
CN110912917A
CN110912917A CN201911207542.5A CN201911207542A CN110912917A CN 110912917 A CN110912917 A CN 110912917A CN 201911207542 A CN201911207542 A CN 201911207542A CN 110912917 A CN110912917 A CN 110912917A
Authority
CN
China
Prior art keywords
url
malicious
sample set
labeled
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911207542.5A
Other languages
Chinese (zh)
Inventor
熊骁
郭岗
林飞
古元
沈智杰
景晓军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Asia Century Technology Development Co Ltd
Shenzhen Science And Technology Development Co Ltd Surfilter
Original Assignee
Beijing Asia Century Technology Development Co Ltd
Shenzhen Science And Technology Development Co Ltd Surfilter
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Asia Century Technology Development Co Ltd, Shenzhen Science And Technology Development Co Ltd Surfilter filed Critical Beijing Asia Century Technology Development Co Ltd
Priority to CN201911207542.5A priority Critical patent/CN110912917A/en
Publication of CN110912917A publication Critical patent/CN110912917A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a malicious URL detection method and system. The malicious URL detection method comprises the following steps: acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set; clustering the URL data set to be labeled by adopting a clustering algorithm so as to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; and subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set. The malicious URL detection method and the system are novel in design and high in practicability.

Description

Malicious URL detection method and system
Technical Field
The invention relates to the technical field of network information security, in particular to a malicious URL detection method and system.
Background
With the rapid development of the internet, more and more malicious URL attacks appear, and the network security is seriously threatened. Conventional URL attack detection systems are primarily through the use of blacklists or rule lists. These lists or rule lists will become longer and longer, and it is not practical to protect against all attacks in these ways. More seriously, these methods are difficult to detect potential threats and it is difficult for network security engineers to effectively discover new malicious URL attacks.
To improve the generalization ability of the algorithm, many researchers have adopted a machine learning-based approach to accomplish this task. These methods are mainly divided into two categories: firstly, in an unsupervised mode, such as an anomaly detection technology, the method does not need to label data; however, the requirement of the model for the input features is far higher than that of a general supervised model, and the performance of the top of the score is difficult to maintain under the condition of a slightly larger number of features. And secondly, in a supervision mode, manual labeling is carried out based on human business experience, and then supervised learning is carried out based on labeling to obtain a model, but the labeling cost is high, and the labeling experts have artificial subjectivity errors, so that the accuracy is reduced.
Supervised learning methods generally achieve greater generalization capabilities when labeled data is available. However, in many cases, it is difficult to obtain accurate annotation data. More than that time, we may get only a small fraction of malicious URLs and a large number of unlabeled URL samples, lacking sufficiently reliable negative examples, which means we cannot directly use the above-mentioned machine learning algorithm. If we simply resolve it unsupervised, then the annotation information for known malicious URLs is difficult to exploit and may not achieve satisfactory performance.
Disclosure of Invention
The invention provides a malicious URL detection method and system aiming at the technical problems.
The technical scheme provided by the invention is as follows:
the invention provides a malicious URL detection method, which comprises the following steps:
s1, acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;
s2, clustering the URL data set to be labeled by adopting a clustering algorithm to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;
and step S3, training the SVM support vector machine by using the updated malicious URL training sample set to classify the updated URL test data set and outputting a label-free URL data set.
In the malicious URL detection method, the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation maximization clustering algorithm, a coacervation level clustering algorithm or a graph group detection method.
In the malicious URL detection method, step S2 adopts a Mini Batch K mean algorithm to cluster URL data sets to be labeled, so as to obtain URL sample sets to be labeled.
In the above malicious URL detection method of the present invention, step S3 further includes: and subtracting the unmarked URL data set from the URL data set to be analyzed, thereby obtaining a final malicious URL data set.
The invention also provides a malicious URL detection system, which comprises:
the active learning module is used for acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;
the labeling module is used for clustering the URL data set to be labeled by adopting a clustering algorithm so as to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;
and the output module is used for training the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set and outputting the unmarked URL data set.
In the malicious URL detection system, the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation maximization clustering algorithm, a coacervation level clustering algorithm or a graph group detection method.
In the malicious URL detection system, the labeling module is further used for clustering the URL data sets to be labeled by adopting a Mini Batch K mean algorithm so as to obtain URL sample sets to be labeled;
in the malicious URL detection system, the output module is used for training the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set and outputting the unmarked URL data set.
The malicious URL detection method and the system can effectively find potential malicious URL attacks, can be used as auxiliary deployment of the existing system, and can also be used for helping network security engineers to effectively find potential attack modes, so that the potential malicious URL attacks can be quickly updated to the existing system. The malicious URL detection method and the system are novel in design and high in practicability.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
fig. 1 is a flowchart illustrating a malicious URL detection method of step S1 according to a preferred embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a change of a processing result of step S1 of the malicious URL detection method illustrated in fig. 1;
fig. 3 is a schematic diagram illustrating a processing result change of the malicious URL detection method according to the preferred embodiment of the present invention;
fig. 4 is a functional block diagram of a malicious URL detection system according to a preferred embodiment of the present invention.
Detailed Description
The technical problem to be solved by the invention is as follows: when URL detection is performed, only a small part of malicious URLs and a large number of unlabeled URL samples are usually obtained, and a sufficiently reliable negative sample is lacking, which means that we cannot directly use a conventional machine learning algorithm. If we simply resolve it unsupervised, then the annotation information for known malicious URLs is difficult to exploit and may not achieve satisfactory performance. The technical idea of the invention for solving the technical problem is as follows: a malicious URL detection method and a system are constructed, and Active Learning (AL for short) is combined with semi-supervised (PU for short). Under the condition that the workload of manual labeling is limited, a malicious URL detection model is developed for a URL data set, and under the same accuracy rate, compared with an unsupervised model and a semi-supervised model, the malicious URL identification amount is greatly improved.
In order to make the technical purpose, technical solutions and technical effects of the present invention more clear and facilitate those skilled in the art to understand and implement the present invention, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.
The preferred embodiment of the invention provides a malicious URL detection method, which comprises the following steps:
s1, acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;
the method adopts an active learning method, and marks the URL to-be-analyzed data set by taking the malicious URL training sample set as a label. Preferably, in the process of classifying the data set to be analyzed of the URL by training the SVM support vector machine with the malicious URL training sample set, a data labeling expert is further used for supervision and optimization iteration to ensure the accuracy of the label. For example, in fig. 1, it is assumed that the URL data set to be analyzed is original unlabeled data x1, x2, and x3 … …, the SVM support vector machine is an Active Learning classifier, the original unlabeled data x1, x2, and x3 … … are labeled by the Active Learning classifier, and in the labeling process, a data labeling expert performs supervision and optimization iteration.
The scheme involved in step S1 does not limit the specific type of the Active Learning classifier in fig. 1, and supervised classification, in which a URL to-be-analyzed data set is directly subjected to secondary classification according to a malicious URL training sample set, is the simplest and most direct method. In order to improve the classification efficiency and accuracy, the malicious URL detection method of the present invention introduces step S2 and step S3.
S2, clustering the URL data set to be labeled by adopting a clustering algorithm to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;
in this step, the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, an expectation maximization clustering algorithm based on a gaussian mixture model, a coacervation level clustering algorithm, or a graph group detection method.
Preferably, in this embodiment, in this step, a Mini Batch K mean algorithm is adopted to cluster the URL data set to be labeled, so as to obtain a URL sample set to be labeled;
the Mini Batch K-means algorithm is a clustering model which can keep clustering accuracy as much as possible and can greatly reduce computing time, the Mini Batch is adopted to reduce the computing time, and meanwhile, an objective function is tried to be optimized. The MiniBatch refers to a data subset which is randomly extracted each time the algorithm is trained, and the randomly selected data are adopted for training, so that the calculation time is greatly reduced, and the convergence time of the K-means algorithm is reduced.
Specifically, the sampling mode is based on the uncertainties & Diversity standard, that is, a sample set with the most uncertain current model and rich Diversity is taken as much as possible. The specific process is as follows: 1) scoring the new data Dnew by using a current model; 2) extracting a plurality of white samples with most uncertain models to form Duncertain, wherein the uncertainty is measured based on model scoring; 3) and (3) carrying out K-Means (K-Means) clustering on Duncertain, and taking out a plurality of most uncertain samples in each class to form a URL sample set to be labeled.
Furthermore, the URL sample set to be labeled is labeled according to the judgment result of whether the URL sample set is malicious or not, so that the URL sample set is divided into a labeled malicious URL sample set and a non-labeled URL sample set. And for the samples which cannot be determined by the expert in the URL sample set to be labeled, summarizing the samples in the URL sample set without being labeled in case that the samples are not labeled.
And step S3, training the SVM support vector machine by using the updated malicious URL training sample set to classify the updated URL test data set and outputting a label-free URL data set.
As shown in fig. 2, the expert may mark for multiple times, gradually expand the L set, and continuously improve the performance of the expert when learning the L set for multiple times. In the malicious URL detection method of the present invention, as shown in fig. 3, the expert labels and expands the P set for multiple times, and updates the learning at each iteration.
The malicious URL detection method can grow a new model on the basis of the existing knowledge (namely the malicious URL training sample set), so that black sample labeling (namely the malicious URL data set) with high accuracy and low recall rate can be brought by the existing knowledge. The malicious URL detection method provided by the invention can be divided into two steps, wherein the first step is step S1, a sample of a malicious URL training sample set is taken as spy and mixed into a URL data set to be analyzed, and multiple rounds of EM iteration are carried out, the second step comprises step S2 and step S3, all samples with the scores smaller than the score of a 10% quantile model in spy in the URL data set to be analyzed are marked by investigating the score distribution of spy samples, and are summarized in an updated URL test data set, and multiple rounds of EM iteration are carried out based on the updated URL test data set.
EM may be understood as an improved method of mle (maximum Likelihood estimation) in the presence of hidden variables, where the missing values are filled in step E, and step M iterates based on the last filling result, so that the final model is generated after many rounds.
Further, in this embodiment, the Active Learning classifier adopts a GBRT (gradient boosting regression tree) based classifier, so that after the malicious URL detection method is executed once, a GBRT model is generated.
Further, step S3 further includes: and subtracting the unmarked URL data set from the URL data set to be analyzed, thereby obtaining a final malicious URL data set.
Further, as shown in fig. 4, fig. 4 is a functional module diagram of a malicious URL detection system according to a preferred embodiment of the present invention. Specifically, malicious URL detection system includes:
the active learning module 100 is configured to obtain a URL data set to be analyzed and obtain a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;
the active learning module 100 marks the URL to-be-analyzed data set by using a malicious URL training sample set as a tag by using an active learning method. Preferably, in the process of classifying the data set to be analyzed of the URL by training the SVM support vector machine with the malicious URL training sample set, a data labeling expert is further used for supervision and optimization iteration to ensure the accuracy of the label. For example, in fig. 1, it is assumed that the URL data set to be analyzed is original unlabeled data x1, x2, and x3 … …, the SVM support vector machine is an Active Learning classifier, the original unlabeled data x1, x2, and x3 … … are labeled by the Active Learning classifier, and in the labeling process, a data labeling expert performs supervision and optimization iteration.
The Active Learning module 100 does not limit the specific types of the Active Learning classifiers in fig. 1 during the work, and supervised classification, in which a URL to be analyzed is directly subjected to secondary classification according to a malicious URL training sample set, is the simplest and most direct method.
The labeling module 200 is configured to cluster the URL data set to be labeled by using a clustering algorithm, so as to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;
the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, an expectation maximization clustering algorithm based on a Gaussian mixture model, a coacervation level clustering algorithm or a graph group detection method.
Preferably, in this embodiment, the tagging module 200 is further configured to cluster the URL data set to be tagged by using a Mini Batch K-means algorithm, so as to obtain a URL sample set to be tagged;
the Mini Batch K-means algorithm is a clustering model which can keep clustering accuracy as much as possible and can greatly reduce computing time, the Mini Batch is adopted to reduce the computing time, and meanwhile, an objective function is tried to be optimized. The MiniBatch refers to a data subset which is randomly extracted each time the algorithm is trained, and the randomly selected data are adopted for training, so that the calculation time is greatly reduced, and the convergence time of the K-means algorithm is reduced.
Specifically, the sampling mode is based on the uncertainties & Diversity standard, that is, a sample set with the most uncertain current model and rich Diversity is taken as much as possible. The specific process is as follows: 1) scoring the new data Dnew by using a current model; 2) extracting a plurality of white samples with most uncertain models to form Duncertain, wherein the uncertainty is measured based on model scoring; 3) and (3) carrying out K-Means (K-Means) clustering on Duncertain, and taking out a plurality of most uncertain samples in each class to form a URL sample set to be labeled.
Furthermore, the URL sample set to be labeled is labeled according to the judgment result of whether the URL sample set is malicious or not, so that the URL sample set is divided into a labeled malicious URL sample set and a non-labeled URL sample set. And for the samples which cannot be determined by the expert in the URL sample set to be labeled, summarizing the samples in the URL sample set without being labeled in case that the samples are not labeled.
And the output module 300 is configured to train the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set, and output a non-labeled URL data set.
It is understood that the output module 300 is further configured to subtract the unmarked URL data set from the data set to be analyzed, so as to obtain a final malicious URL data set.
The malicious URL detection method and the system can effectively find potential malicious URL attacks, can be used as auxiliary deployment of the existing system, and can also be used for helping network security engineers to effectively find potential attack modes, so that the potential malicious URL attacks can be quickly updated to the existing system. The malicious URL detection method and the system are novel in design and high in practicability.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (8)

1. A malicious URL detection method is characterized by comprising the following steps:
s1, acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;
s2, clustering the URL data set to be labeled by adopting a clustering algorithm to obtain a URL sample set to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;
and step S3, training the SVM support vector machine by using the updated malicious URL training sample set to classify the updated URL test data set and outputting a label-free URL data set.
2. The malicious URL detection method according to claim 1, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation-maximization clustering algorithm, a coacervation hierarchy clustering algorithm, or a graph community detection method.
3. The method according to claim 2, wherein in step S2, the Mini BatchK mean algorithm is adopted to cluster the URL data sets to be labeled, so as to obtain the URL sample sets to be labeled.
4. The malicious URL detection method according to claim 1, wherein the step S3 further comprises: and subtracting the unmarked URL data set from the URL data set to be analyzed, thereby obtaining a final malicious URL data set.
5. A malicious URL detection system, comprising:
the active learning module (100) is used for acquiring a URL to-be-analyzed data set and acquiring a malicious URL training sample set; utilizing a malicious URL training sample set to train an SVM (support vector machine) to classify a URL to-be-analyzed data set to obtain a malicious URL data set and a URL to-be-labeled data set;
the labeling module (200) is used for clustering the URL data sets to be labeled by adopting a clustering algorithm so as to obtain URL sample sets to be labeled; labeling the URL sample set to be labeled according to a judgment result of whether the URL sample set has a malicious meaning, so that the URL sample set to be labeled is divided into a labeled malicious URL sample set and a non-labeled URL sample set; combining the marked malicious URL sample set and the malicious URL training sample set in a mode of collecting and solving a union to obtain an updated malicious URL training sample set; subtracting the marked malicious URL sample set from the URL data set to be marked to obtain an updated URL test data set;
and the output module (300) is used for training the SVM support vector machine to classify the updated URL test data set by using the updated malicious URL training sample set and outputting the unmarked URL data set.
6. The malicious URL detection system according to claim 5, wherein the clustering algorithm is a K-means clustering algorithm, a mean shift clustering algorithm, a density-based clustering algorithm, a Gaussian mixture model-based expectation-maximization clustering algorithm, a coacervation hierarchy clustering algorithm, or a graph community detection method.
7. The malicious URL detection system according to claim 6, wherein the labeling module (200) is further configured to cluster the URL data sets to be labeled by using a Mini Batch K-means algorithm, so as to obtain URL sample sets to be labeled.
8. The malicious URL detection system according to claim 5, wherein the output module (300) is configured to train an SVM support vector machine with the updated malicious URL training sample set to classify the updated URL test data set and output the unlabeled URL data set.
CN201911207542.5A 2019-11-29 2019-11-29 Malicious URL detection method and system Pending CN110912917A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911207542.5A CN110912917A (en) 2019-11-29 2019-11-29 Malicious URL detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911207542.5A CN110912917A (en) 2019-11-29 2019-11-29 Malicious URL detection method and system

Publications (1)

Publication Number Publication Date
CN110912917A true CN110912917A (en) 2020-03-24

Family

ID=69821092

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911207542.5A Pending CN110912917A (en) 2019-11-29 2019-11-29 Malicious URL detection method and system

Country Status (1)

Country Link
CN (1) CN110912917A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523620A (en) * 2020-07-03 2020-08-11 北京每日优鲜电子商务有限公司 Dynamic adjustment method and commodity verification method for commodity identification model
CN111680742A (en) * 2020-06-04 2020-09-18 甘肃电力科学研究院 Attack data labeling method applied to new energy plant station network security field
CN112615861A (en) * 2020-12-17 2021-04-06 赛尔网络有限公司 Malicious domain name identification method and device, electronic equipment and storage medium
CN114553496A (en) * 2022-01-28 2022-05-27 中国科学院信息工程研究所 Malicious domain name detection method and device based on semi-supervised learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176701A (en) * 2011-02-18 2011-09-07 哈尔滨工业大学 Active learning based network data anomaly detection method
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN104992184A (en) * 2015-07-02 2015-10-21 东南大学 Multiclass image classification method based on semi-supervised extreme learning machine
CN109831460A (en) * 2019-03-27 2019-05-31 杭州师范大学 A kind of Web attack detection method based on coorinated training
WO2019109743A1 (en) * 2017-12-07 2019-06-13 阿里巴巴集团控股有限公司 Url attack detection method and apparatus, and electronic device
CN110413924A (en) * 2019-07-18 2019-11-05 广东石油化工学院 A kind of Web page classification method of semi-supervised multiple view study
US20190349399A1 (en) * 2017-10-31 2019-11-14 Guangdong University Of Technology Character string classification method and system, and character string classification device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176701A (en) * 2011-02-18 2011-09-07 哈尔滨工业大学 Active learning based network data anomaly detection method
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN104992184A (en) * 2015-07-02 2015-10-21 东南大学 Multiclass image classification method based on semi-supervised extreme learning machine
US20190349399A1 (en) * 2017-10-31 2019-11-14 Guangdong University Of Technology Character string classification method and system, and character string classification device
WO2019109743A1 (en) * 2017-12-07 2019-06-13 阿里巴巴集团控股有限公司 Url attack detection method and apparatus, and electronic device
CN109831460A (en) * 2019-03-27 2019-05-31 杭州师范大学 A kind of Web attack detection method based on coorinated training
CN110413924A (en) * 2019-07-18 2019-11-05 广东石油化工学院 A kind of Web page classification method of semi-supervised multiple view study

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YA-LIN ZHANG, LONGFEI LI, JUN ZHOU, ET AL: "POSTER: A PU Learning based System for PotentialMalicious URL Detection", 《ACM》 *
刘露,彭涛,左万利, 戴耀康: "一种基于聚类的 PU 主动文本分类方法", 《软件学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680742A (en) * 2020-06-04 2020-09-18 甘肃电力科学研究院 Attack data labeling method applied to new energy plant station network security field
CN111523620A (en) * 2020-07-03 2020-08-11 北京每日优鲜电子商务有限公司 Dynamic adjustment method and commodity verification method for commodity identification model
CN111523620B (en) * 2020-07-03 2020-10-20 北京每日优鲜电子商务有限公司 Dynamic adjustment method and commodity verification method for commodity identification model
CN112615861A (en) * 2020-12-17 2021-04-06 赛尔网络有限公司 Malicious domain name identification method and device, electronic equipment and storage medium
CN114553496A (en) * 2022-01-28 2022-05-27 中国科学院信息工程研究所 Malicious domain name detection method and device based on semi-supervised learning
CN114553496B (en) * 2022-01-28 2022-11-15 中国科学院信息工程研究所 Malicious domain name detection method and device based on semi-supervised learning

Similar Documents

Publication Publication Date Title
CN110912917A (en) Malicious URL detection method and system
CN107067025B (en) Text data automatic labeling method based on active learning
US7570816B2 (en) Systems and methods for detecting text
CN109871954B (en) Training sample generation method, abnormality detection method and apparatus
Kosmidis et al. Machine learning and images for malware detection and classification
CN105897517A (en) Network traffic abnormality detection method based on SVM (Support Vector Machine)
CN111126576B (en) Deep learning training method
CN107943856A (en) A kind of file classification method and system based on expansion marker samples
CN111259219B (en) Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system
CN113489685B (en) Secondary feature extraction and malicious attack identification method based on kernel principal component analysis
CN111222471A (en) Zero sample training and related classification method based on self-supervision domain perception network
CN108446559A (en) A kind of recognition methods of APT tissue and device
Fang et al. Sparse similarity metric learning for kinship verification
CN103942749A (en) Hyperspectral ground feature classification method based on modified cluster hypothesis and semi-supervised extreme learning machine
US8699796B1 (en) Identifying sensitive expressions in images for languages with large alphabets
CN116051479A (en) Textile defect identification method integrating cross-domain migration and anomaly detection
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
CN113609488A (en) Vulnerability detection method and system based on self-supervised learning and multichannel hypergraph neural network
Jiang et al. Dynamic proposal sampling for weakly supervised object detection
Cheng et al. Tracing retinal blood vessels by matrix-forest theorem of directed graphs
Ghanmi et al. Table detection in handwritten chemistry documents using conditional random fields
Moller et al. Active learning for the classification of species in underwater images from a fixed observatory
Ullman et al. Smart vulnerability assessment for scientific cyberinfrastructure: An unsupervised graph embedding approach
CN113343123A (en) Training method and detection method for generating confrontation multiple relation graph network
CN117516937A (en) Rolling bearing unknown fault detection method based on multi-mode feature fusion enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200324

RJ01 Rejection of invention patent application after publication