CN113378955A - Intrusion detection method based on active learning - Google Patents

Intrusion detection method based on active learning Download PDF

Info

Publication number
CN113378955A
CN113378955A CN202110695864.XA CN202110695864A CN113378955A CN 113378955 A CN113378955 A CN 113378955A CN 202110695864 A CN202110695864 A CN 202110695864A CN 113378955 A CN113378955 A CN 113378955A
Authority
CN
China
Prior art keywords
samples
sample
training
active learning
detection method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110695864.XA
Other languages
Chinese (zh)
Inventor
徐润
陈林森
胡兵轩
杨涵
陈挺
杨隽奎
郑智浩
周仲波
邓德茂
覃禹铭
王龙海
余云昊
李勇
江再能
董双
金基伟
任庭昊
代启灿
李瑶
王开波
唐剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Power Grid Co Ltd
Original Assignee
Guizhou Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Power Grid Co Ltd filed Critical Guizhou Power Grid Co Ltd
Priority to CN202110695864.XA priority Critical patent/CN113378955A/en
Publication of CN113378955A publication Critical patent/CN113378955A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The invention discloses an intrusion detection method based on active learning, which comprises the steps of collecting historical data by using a system log and preprocessing the historical data to obtain a tag sample data set; constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier; and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model. The algorithm provided by the invention not only depends on the classification result of a single classifier to determine the labeled sample, but also determines the labeled sample by training a plurality of classifiers and voting results of the classifiers, so that the accuracy of labeling can be improved well.

Description

Intrusion detection method based on active learning
Technical Field
The invention relates to the technical field of classification detection, in particular to an intrusion detection method based on active learning.
Background
A direct push Support Vector Machine (TSVM) is a maximum interval classification method based on low density segmentation assumptions. Much like a traditional support vector machine, it finds the classification hyperplane with the largest separation as the optimal classification hyperplane, while training the classification model considering both unlabeled and labeled data.
The traditional machine learning method is to train and learn on a given labeled sample set and to induce a learning model, which is called inductive learning. However, in practical application, marked samples are very limited, and it is very time-consuming, labor-consuming and tedious to mark a large number of unlabeled samples, and in order to reduce the marking cost and reduce the training sample set as much as possible, an active learning method is provided to solve the problem of lack of labeled samples and optimize the classification model. For active learning, the learner can actively select the most favorable label-free sample (i.e., the sample with the largest information content) for the classifier to be promoted and submit to a user or a domain expert for labeling, and then the labeled sample is added into a training sample set as labeled data to participate in the next round of training, so that higher classification accuracy can be obtained under the condition that the training set is smaller, the cost for labeling the sample can be reduced, and the cost for training the high-performance classifier is also reduced.
On one hand, each type of algorithm has advantages and disadvantages, for example, some algorithms may have good effects on a certain attack type, but have poor detection effects on other types of attacks; on the other hand, many studies have focused on improving the overall detection accuracy, and have not performed well on a small sample (attack sample). In practice, however, considering the case that the attack sample is extremely unbalanced relative to the normal sample, attention should be paid to the detection capability of the intrusion detection classifier on the attack sample.
How to solve intrusion detection in the case of small samples: the first condition is as follows: the normal sample is far larger than the modeling problem under the condition of the attack sample; case two: the number of labeled samples is very rare, while the unlabeled samples are very rich, and how to model with these two types of samples.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The present invention has been made in view of the above-mentioned conventional problems.
Therefore, the invention provides an intrusion detection method based on active learning, which can solve the problem of the current intrusion detection classification accuracy.
In order to solve the technical problems, the invention provides the following technical scheme: acquiring historical data by using a system log and preprocessing the historical data to obtain a tag sample data set; constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier; and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the pretreatment comprises normalization treatment; the active learning strategies include membership queries, flow-based selective sampling, and pool-based selective sampling.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: training the detection classification model by combining the semi-supervised direct-pushing type support vector machine, wherein a group of independent and identically distributed labeled samples are defined, and the method comprises the following steps of,
{(x1,D1),L,(xi,Di)}∈Rn×R,i=1,L,l,yi={-1,+1}
the non-labeled sample includes a sample of,
{xl+1,L,xl+u}
the learning process of the semi-supervised direct-pushing support vector machine is a process for solving an optimization problem, and comprises the following steps,
min(y1,L,yn,w,b,ξ1,L,ξll+1,L,ξl+u)
Figure BDA0003127871930000021
Figure BDA0003127871930000022
s.t.:
Figure BDA0003127871930000023
wherein, C1And C2Set by the user for controlling the penalty for misclassifying samples, C2Influence factors of label-free data in the training process; c2ξjReferred to as the impact term of the jth unlabeled exemplar in the objective function.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the training process comprises setting a parameter C1And C2Training the labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples; calculating decision function values of all the label-free samples by using an initial classifier; marking the first N unlabeled samples with larger decision function values as positive samples, marking the rest unlabeled samples as negative samples, and setting CtempIs a temporary influencing factor.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: retraining the SVM model on all marked samples, and exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible for the newly generated classifier until no sample meeting the exchange condition exists, otherwise, repeating the process; to CtempIs uniformly increased when C istemp≥C2When the algorithm terminates, all unlabeled returnsAnd (5) marking the label of the sample.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the cluster analysis comprises the steps of extracting a certain number of samples from each cluster according to a certain proportion to a sample set with a label by utilizing the K-Means clustering algorithm to form n sub-sample sets, wherein n is an odd number and is more than 1, and the n is used as a training set; training according to n training sets to obtain n initial classifiers C1,C2,…,Cn(ii) a Predicting each unlabeled sample by using the n initial classifiers and outputting f1,f2,…,fn(ii) a Labeling the unlabeled sample, and determining whether to iterate further according to a termination condition.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the clustering includes defining a set of labeled samples as L ═ x1,x2,L,xlThe clustering number is K; the iteration round is r, and the initial value is 0; setting initial K clustering centers as
Figure BDA0003127871930000031
Define the corresponding set of the ith type sample as
Figure BDA0003127871930000032
For any one sample xjJ is 1, L, L, if xjDistance cluster center
Figure BDA0003127871930000033
Is shortest, the sample xjAdding into
Figure BDA0003127871930000034
Class;
Figure BDA0003127871930000035
recalculating the K clustering centers, specifically as follows:
Figure BDA0003127871930000036
wherein the content of the first and second substances,
Figure BDA0003127871930000037
as a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: also comprises the following steps of (1) preparing,
a clustering criterion function is defined and a clustering error is calculated, as follows,
Figure BDA0003127871930000038
judging whether a stop condition is reached;
if the value of | E (t-1) -E (t) | is less than the preset error value, the finally obtained cluster and cluster center are respectively:
Figure BDA0003127871930000041
otherwise, r is set to r + 1.
The invention has the beneficial effects that: firstly, the algorithm provided by the invention not only depends on the classification result of a single classifier to determine the labeled sample, but also determines the labeled sample by training a plurality of classifiers and the voting result of the classifiers, so that the accuracy of labeling can be well improved; when a plurality of classifiers are trained, the training set needs to be divided into a plurality of small sub-training sets, how to divide the sub-training sets determines the performance of the trained classifier, an improved clustering algorithm is adopted, the geometric characteristics and the spatial distribution characteristics of the labeled samples are fully considered, the training sets are clustered, samples in a certain proportion are extracted according to the clustering result to construct a new training set and train the classifier, and therefore the performance of the classifier can be effectively improved; in each iteration, because the scale of each training set is not large, the training time overhead required by each classifier is relatively small; and fourthly, the sample with the label and the obtained labeled sample before training obtain a final classifier, so that the training speed of the classifier is improved, and meanwhile, the performance of the classifier can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
fig. 1 is a schematic flowchart of an intrusion detection method based on active learning according to an embodiment of the present invention;
fig. 2 is a schematic diagram of three scenarios of an active learning strategy of an intrusion detection method based on active learning according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a multi-classifier voting strategy labeling idea framework of an intrusion detection method based on active learning according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1
Referring to fig. 1 to 3, a first embodiment of the present invention provides an intrusion detection method based on active learning, which specifically includes:
s1: and acquiring historical data by using the system log and preprocessing the historical data to obtain a tag sample data set.
S2: and constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier.
S3: and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.
The pretreatment comprises the following steps: and (6) normalization processing.
The active learning strategy comprises: membership queries, stream-based selective sampling, and pool-based selective sampling.
Referring to fig. 2, the membership query includes that the generated query is constructed by itself and may not exist in the original sample set, and the attribute values of all the generated samples are based on its standard, and the main objective is to construct the query that is best for improving the performance of the learner; the selective sampling based on the flow comprises that unmarked samples are submitted to a selection engine one by one according to the sequence, the selection engine determines whether to be marked or not, if not, the unmarked samples are discarded, the selective sampling based on the flow adapts to different conditions based on the flow through an adjusting method, but the selective sampling based on the flow can not realize the one-by-one comparison of the unmarked samples, the evaluation indexes and the corresponding threshold values of the samples need to be set according to a certain principle, if the evaluation indexes of the samples submitted to the selection engine exceed the threshold values, the selective sampling based on the pool comprises maintaining an unmarked sample pool, and according to a certain principle, the selection engine selects the samples needing to be marked from the pool.
The method is combined with a semi-supervised direct-pushing type support vector machine to train, detect and classify the model, and comprises the following steps:
a set of independently identically distributed labeled exemplars is defined, including,
{(x1,D1),L,(xi,Di)}∈Rn×R,i=1,L,l,yi={-1,+1}
the non-labeled sample includes a sample of,
{xl+1,L,xl+u}
the learning process of the semi-supervised direct-push support vector machine is a process for solving an optimization problem, and comprises the following steps,
min(y1,L,yn,w,b,ξ1,L,ξll+1,L,ξl+u)
Figure BDA0003127871930000061
Figure BDA0003127871930000062
s.t.:
Figure BDA0003127871930000063
wherein, C1And C2Set by the user for controlling the penalty for misclassifying samples, C2Influence factors of label-free data in the training process; c2ξjReferred to as the impact term of the jth unlabeled exemplar in the objective function.
Further, the training process comprises:
setting parameter C1And C2Training labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples;
calculating decision function values of all the label-free samples by using an initial classifier;
the first N unlabeled samples with larger decision function values are marked as positive samples, the rest unlabeled samples are marked as negative samples, and C is settempIs a temporary influence factor;
retraining the SVM model on all marked samples, and for a newly generated classifier, exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible until no sample meeting the exchange condition exists, otherwise, repeating the process;
to CtempIs uniformly increased when C istemp≥C2At this point, the algorithm terminates and returns the labels for all unlabeled samples.
Referring to fig. 3, in order to increase the training speed of the TSVM and the accuracy of labeling samples in each iteration, in this embodiment, a multi-classifier collaborative voting mechanism is used to label unlabeled samples, so that on one hand, the time complexity of iterative training can be reduced; on the other hand, the plurality of classifiers determine the category of the sample through a voting mechanism, so that the accuracy of labeling the sample in each iteration is improved. The method specifically comprises the following steps:
dividing the whole sample set into a labeled sample set L and an unlabeled sample set U;
extracting a certain number of samples from each cluster according to a certain proportion to form n sub-sample sets by utilizing a K-Means clustering algorithm, wherein n is an odd number and is more than 1, and taking the n sub-sample sets as training sets;
training according to n training sets to obtain n initial classifiers C1,C2,…,Cn
Predicting each unlabeled sample by using n initial classifiers and outputting f1,f2,…,fn
And labeling the unlabeled samples, and determining whether to iterate further according to a termination condition.
Specifically, the clustering includes:
defining a set of tagged samples as L ═ x1,x2,L,xlThe clustering number is K;
the iteration round is r, and the initial value is 0;
setting initial K clustering centers as
Figure BDA0003127871930000071
Define the corresponding set of the ith type sample as
Figure BDA0003127871930000072
For any one sample xjJ is 1, L, L, if xjDistance cluster center
Figure BDA0003127871930000073
Is shortest, the sample xjAdding into
Figure BDA0003127871930000074
Class;
Figure BDA0003127871930000075
recalculating the K clustering centers, specifically as follows:
Figure BDA0003127871930000076
wherein the content of the first and second substances,
Figure BDA0003127871930000077
a clustering criterion function is defined and a clustering error is calculated, as follows,
Figure BDA0003127871930000081
judging whether a stop condition is reached;
if the value of | E (t-1) -E (t) | is less than the preset error value, the finally obtained cluster and cluster center are respectively:
Figure BDA0003127871930000082
otherwise, r is set to r + 1.
Example 2
Different from the first embodiment, the embodiment provides a validation specification for solving the problem of labeled samples and iterative training by an intrusion detection method based on active learning, and specifically comprises the following steps:
(1) and selecting the required marked sample.
For example: paired sample labeling in the TSVM has the advantage of high labeling accuracy, but the labeling speed is very low; the method has the advantages that a plurality of samples can be labeled at one time based on the region labeling method, the labeling speed is high, but the labeling accuracy cannot be guaranteed; the embodiment provides a voting decision marking method based on multiple classifiers, wherein m samples which belong to more than half of classifiers (namely minority obeys majority) in a boundary region are selected for marking, and if the m samples meet the maximum classification hyperplane, the m samples are marked as a positive class; if m samples satisfy the minimum classification hyperplane, then label as negative class.
Defining a classification hyperplane as f (x), and setting an unlabeled sample set as U ═ xl+1,xl+2,L,xl+uH, then sample xiThe distance to the classification hyperplane is expressed as:
Figure BDA0003127871930000083
in order to select the unlabeled samples most likely to be the support vectors for labeling and improve the labeling speed, in each iteration process, the unlabeled samples meeting the first p maximum values of the maximum classification hyperplane are selected and labeled as positive samples, and the unlabeled samples meeting the last q minimum values of the minimum classification hyperplane are selected and labeled as negative samples.
Figure BDA0003127871930000084
Figure BDA0003127871930000085
The values of p and q determine the number of samples labeled in one iteration, namely the learning speed of the direct-push learning, and when the values of p and q are 1, a pair labeling method is adopted; when the values of p and q are larger than 1, one iteration is expressed to label p or q samples in the optimal classification hyperplane boundary region, and the values of p and q can be optimized according to the actual application scene.
(2) And adding the marked samples into the corresponding classifier, and further performing iterative training.
After the labeling of the samples, if the stopping condition of the algorithm is not reached, the samples need to be added into the training set for iteration, and it should be noted that the samples are not added into all the training sets, but are added into the training set corresponding to the classifier with the output class label consistent with the output class label.
For example: A. b, C, D, E the results of the five classifiers output are: the positive type, the positive type and the negative type, wherein the samples are added into the training set corresponding to the training classifier E obviously, the samples are not suitable, in addition, if the output results of the A classifier and the B classifier simultaneously meet the maximum classification hyperplane, the output result of the A is 0.05, and the output result of the B is 0.95, the probability that the A labels the samples correctly is very low relative to the B, according to the analysis, the fact that the labeled samples are added into the training set corresponding to the classifier with the labeling category which is the same as the output result and the maximum output value can be found, and the training difference and the sample labeling accuracy can be guaranteed.
(4) An iteration termination condition is determined.
In the iterative process of sample labeling and model training, if the type (positive sample or negative sample) of the sample labeled in the current iteration is different from the type of the sample labeled in the previous iteration, the label needs to be reset, that is, the sample needs to be labeled again.
The method specifically comprises the following steps:
taking the sample as an unlabeled sample, deleting the unlabeled sample from a corresponding training set, and entering the next iteration;
if the sample needing to be reset does not appear in one iteration, and meanwhile, the unlabeled sample meeting the labeling condition does not exist, stopping the iteration;
after the iteration stops, combining the n training subsets, and then training and obtaining the final classifier.
The algorithm applied in the embodiment determines the category of the labeled sample by using the collaborative voting mechanism of the plurality of classifiers, so that the labeling accuracy is improved, and simultaneously, the sample is labeled in a batch mode, so that the labeling efficiency is improved.
The specific steps of the algorithm are described as follows:
the algorithm is as follows: and a TSVM algorithm based on multi-classifier collaborative annotation.
Inputting: a set of labeled samples L; a label-free sample set U; the number of classifiers n.
And (3) outputting: the final classifier TSVM.
S1: clustering the sample set L with the labels by adopting a K-Means algorithm, extracting samples from each cluster according to a certain proportion to form n sub-training sets, and recording as: l is1,L2,L,Ln
S2: training n training subsets by adopting an SVM algorithm to obtain n initial classifiers: c1,C2,L,Cn
S3: inputting unlabeled sample into C1,C2,L,CnAnd the n initial classifiers obtain n output results: f. of1 i,
Figure BDA0003127871930000101
L,
Figure BDA0003127871930000102
S4: for any unlabeled sample xjIf the classification results of the n classifications meet the maximum classification hyperplane, marking the classification result as a positive classification; if the classification result of the n classifications meets the minimum classification hyperplane, labeling the classification result as a negative classification;
s5: if the labeling type of the current xj is inconsistent with the previous labeling type, the label needs to be reset and deleted from the corresponding training set;
if the currently labeled category is consistent with the previous stage,
Figure BDA0003127871930000103
is not consistent with the previous stage, then the sample is added to LjPerforming the following steps;
if the sample is not marked in the early stage, the requirement is met
Figure BDA0003127871930000104
J and add the sample to LjPerforming the following steps; otherwise, stopping iteration and jumping to S8;
s6: repeatedly executing S4 and S5 until all unlabeled samples are labeled;
s7: after obtaining new training subsets, retraining these new sub-training sets and obtaining new classifiers: c1 new,C2 new,L,Cn new
If there is a case that the previous round and the training set of this round are not changed, the corresponding training needs to continue using the classifier of the previous round, and then the process jumps to S3;
s8: and summarizing the training subsets to form a final training set, and then retraining the sample set to obtain a final classifier.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (8)

1. An intrusion detection method based on active learning is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
acquiring historical data by using a system log and preprocessing the historical data to obtain a tag sample data set;
constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier;
and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.
2. The active learning-based intrusion detection method according to claim 1, wherein: the pretreatment comprises normalization treatment;
the active learning strategies include membership queries, flow-based selective sampling, and pool-based selective sampling.
3. The intrusion detection method based on active learning according to claim 1 or 2, wherein: training the detection classification model in combination with the semi-supervised direct-push support vector machine, including,
a set of independently identically distributed labeled exemplars is defined, including,
{(x1,D1),L,(xi,Di)}∈Rn×R,i=1,L,l,yi={-1,+1}
the non-labeled sample includes a sample of,
{xl+1,L,xl+u}
the learning process of the semi-supervised direct-pushing support vector machine is a process for solving an optimization problem, and comprises the following steps,
min(y1,L,yn,w,b,ξ1,L,ξll+1,L,ξl+u)
Figure FDA0003127871920000011
Figure FDA0003127871920000012
Figure FDA0003127871920000013
wherein, C1And C2Set by the user for controlling the penalty for misclassifying samples, C2Influence factors of label-free data in the training process; c2ξjReferred to as the impact term of the jth unlabeled exemplar in the objective function.
4. The active learning-based intrusion detection method according to claim 3, wherein: the training process comprises the steps of training,
setting parameter C1And C2Training the labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples;
calculating decision function values of all the label-free samples by using an initial classifier;
marking the first N unlabeled samples with larger decision function values as positive samples, marking the rest unlabeled samples as negative samples, and setting CtempIs a temporary influencing factor.
5. The active learning-based intrusion detection method according to claim 4, wherein: also comprises the following steps of (1) preparing,
retraining the SVM model on all marked samples, and for a newly generated classifier, exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible until no sample meeting the exchange condition exists, otherwise, repeating the process;
to CtempIs uniformly increased when C istemp≥C2At this point, the algorithm terminates and returns the labels for all unlabeled samples.
6. The active learning-based intrusion detection method according to claim 5, wherein: the cluster analysis includes a first step of performing cluster analysis on the data,
extracting a certain number of samples from each cluster according to a certain proportion to form n sub-sample sets by utilizing the K-Means clustering algorithm, wherein n is an odd number and is more than 1, and taking the n sub-sample sets as training sets;
training according to n training sets to obtain n initial classifiers C1,C2,…,Cn
Predicting each unlabeled sample by using the n initial classifiers and outputting f1,f2,…,fn
Labeling the unlabeled sample, and determining whether to iterate further according to a termination condition.
7. The active learning-based intrusion detection method according to claim 6, wherein: the clustering includes the steps of, for example,
defining a set of tagged samples as L ═ x1,x2,L,xlThe clustering number is K;
the iteration round is r, and the initial value is 0;
setting initial K clustering centers as
Figure FDA0003127871920000021
Define the corresponding set of the ith type sample as
Figure FDA0003127871920000022
For any one sample xjJ is 1, L, L, if xjDistance cluster center
Figure FDA0003127871920000023
Is shortest, the sample xjAdding into
Figure FDA0003127871920000024
Class;
Figure FDA0003127871920000025
recalculating the K clustering centers, specifically as follows:
Figure FDA0003127871920000026
wherein the content of the first and second substances,
Figure FDA0003127871920000027
8. the active learning-based intrusion detection method according to claim 7, wherein: also comprises the following steps of (1) preparing,
a clustering criterion function is defined and a clustering error is calculated, as follows,
Figure FDA0003127871920000031
judging whether a stop condition is reached;
if the value of | E (t-1) -E (t) | is less than the preset error value, the finally obtained cluster and cluster center are respectively:
Figure FDA0003127871920000032
otherwise, r is set to r + 1.
CN202110695864.XA 2021-06-23 2021-06-23 Intrusion detection method based on active learning Pending CN113378955A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110695864.XA CN113378955A (en) 2021-06-23 2021-06-23 Intrusion detection method based on active learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110695864.XA CN113378955A (en) 2021-06-23 2021-06-23 Intrusion detection method based on active learning

Publications (1)

Publication Number Publication Date
CN113378955A true CN113378955A (en) 2021-09-10

Family

ID=77578692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110695864.XA Pending CN113378955A (en) 2021-06-23 2021-06-23 Intrusion detection method based on active learning

Country Status (1)

Country Link
CN (1) CN113378955A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880268A (en) * 2022-12-28 2023-03-31 南京航空航天大学 Method, system, equipment and medium for detecting defective products in plastic hose production

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176701A (en) * 2011-02-18 2011-09-07 哈尔滨工业大学 Active learning based network data anomaly detection method
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN111191238A (en) * 2019-12-30 2020-05-22 厦门服云信息科技有限公司 Webshell detection method, terminal device and storage medium
CN112115467A (en) * 2020-09-04 2020-12-22 长沙理工大学 Intrusion detection method based on semi-supervised classification of ensemble learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102176701A (en) * 2011-02-18 2011-09-07 哈尔滨工业大学 Active learning based network data anomaly detection method
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN111191238A (en) * 2019-12-30 2020-05-22 厦门服云信息科技有限公司 Webshell detection method, terminal device and storage medium
CN112115467A (en) * 2020-09-04 2020-12-22 长沙理工大学 Intrusion detection method based on semi-supervised classification of ensemble learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
杜红乐等: "协同标注的直推式支持向量机算法", 《小型微型计算机系统》 *
杜红乐等: "基于聚类和协同标注的TSVM算法", 《河南科学》 *
王立梅 等: "基于k均值聚类的直推式支持向量机学习算法", 《计算机工程与应用》 *
赵建华等: "结合主动学习和半监督学习的网络入侵检测算法", 《西华大学学报(自然科学版)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880268A (en) * 2022-12-28 2023-03-31 南京航空航天大学 Method, system, equipment and medium for detecting defective products in plastic hose production
CN115880268B (en) * 2022-12-28 2024-01-30 南京航空航天大学 Method, system, equipment and medium for detecting inferior goods in plastic hose production

Similar Documents

Publication Publication Date Title
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
Chong et al. Simultaneous image classification and annotation
Li et al. Confidence-based active learning
Torralba et al. Sharing visual features for multiclass and multiview object detection
CN109034205A (en) Image classification method based on the semi-supervised deep learning of direct-push
CN101447020B (en) Pornographic image recognizing method based on intuitionistic fuzzy
CN108228569B (en) Chinese microblog emotion analysis method based on collaborative learning under loose condition
US20210319215A1 (en) Method and system for person re-identification
CN110647907B (en) Multi-label image classification algorithm using multi-layer classification and dictionary learning
CN113326731A (en) Cross-domain pedestrian re-identification algorithm based on momentum network guidance
CN107392241A (en) A kind of image object sorting technique that sampling XGBoost is arranged based on weighting
CN113378913B (en) Semi-supervised node classification method based on self-supervised learning
Peng et al. Text classification in Asian languages without word segmentation
CN110942091A (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN105930792A (en) Human action classification method based on video local feature dictionary
Schinas et al. CERTH@ MediaEval 2012 Social Event Detection Task.
CN107291936A (en) The hypergraph hashing image retrieval of a kind of view-based access control model feature and sign label realizes that Lung neoplasm sign knows method for distinguishing
CN110008365B (en) Image processing method, device and equipment and readable storage medium
JP5754310B2 (en) Identification information providing program and identification information providing apparatus
CN112418331A (en) Clustering fusion-based semi-supervised learning pseudo label assignment method
WO2020024444A1 (en) Group performance grade recognition method and apparatus, and storage medium and computer device
CN113378955A (en) Intrusion detection method based on active learning
CN110765285A (en) Multimedia information content control method and system based on visual characteristics
CN113920573B (en) Face change decoupling relativity relationship verification method based on counterstudy
Li et al. Face recognition using improved pairwise coupling support vector machines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210910