CN113378955A

CN113378955A - Intrusion detection method based on active learning

Info

Publication number: CN113378955A
Application number: CN202110695864.XA
Authority: CN
Inventors: 徐润; 陈林森; 胡兵轩; 杨涵; 陈挺; 杨隽奎; 郑智浩; 周仲波; 邓德茂; 覃禹铭; 王龙海; 余云昊; 李勇; 江再能; 董双; 金基伟; 任庭昊; 代启灿; 李瑶; 王开波
Original assignee: Guizhou Power Grid Co Ltd
Current assignee: Guizhou Power Grid Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-09-10

Abstract

The invention discloses an intrusion detection method based on active learning, which comprises the steps of collecting historical data by using a system log and preprocessing the historical data to obtain a tag sample data set; constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier; and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model. The algorithm provided by the invention not only depends on the classification result of a single classifier to determine the labeled sample, but also determines the labeled sample by training a plurality of classifiers and voting results of the classifiers, so that the accuracy of labeling can be improved well.

Description

Intrusion detection method based on active learning

Technical Field

The invention relates to the technical field of classification detection, in particular to an intrusion detection method based on active learning.

Background

A direct push Support Vector Machine (TSVM) is a maximum interval classification method based on low density segmentation assumptions. Much like a traditional support vector machine, it finds the classification hyperplane with the largest separation as the optimal classification hyperplane, while training the classification model considering both unlabeled and labeled data.

The traditional machine learning method is to train and learn on a given labeled sample set and to induce a learning model, which is called inductive learning. However, in practical application, marked samples are very limited, and it is very time-consuming, labor-consuming and tedious to mark a large number of unlabeled samples, and in order to reduce the marking cost and reduce the training sample set as much as possible, an active learning method is provided to solve the problem of lack of labeled samples and optimize the classification model. For active learning, the learner can actively select the most favorable label-free sample (i.e., the sample with the largest information content) for the classifier to be promoted and submit to a user or a domain expert for labeling, and then the labeled sample is added into a training sample set as labeled data to participate in the next round of training, so that higher classification accuracy can be obtained under the condition that the training set is smaller, the cost for labeling the sample can be reduced, and the cost for training the high-performance classifier is also reduced.

On one hand, each type of algorithm has advantages and disadvantages, for example, some algorithms may have good effects on a certain attack type, but have poor detection effects on other types of attacks; on the other hand, many studies have focused on improving the overall detection accuracy, and have not performed well on a small sample (attack sample). In practice, however, considering the case that the attack sample is extremely unbalanced relative to the normal sample, attention should be paid to the detection capability of the intrusion detection classifier on the attack sample.

How to solve intrusion detection in the case of small samples: the first condition is as follows: the normal sample is far larger than the modeling problem under the condition of the attack sample; case two: the number of labeled samples is very rare, while the unlabeled samples are very rich, and how to model with these two types of samples.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned conventional problems.

Therefore, the invention provides an intrusion detection method based on active learning, which can solve the problem of the current intrusion detection classification accuracy.

In order to solve the technical problems, the invention provides the following technical scheme: acquiring historical data by using a system log and preprocessing the historical data to obtain a tag sample data set; constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier; and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.

As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the pretreatment comprises normalization treatment; the active learning strategies include membership queries, flow-based selective sampling, and pool-based selective sampling.

As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: training the detection classification model by combining the semi-supervised direct-pushing type support vector machine, wherein a group of independent and identically distributed labeled samples are defined, and the method comprises the following steps of,

{(x₁,D₁),L,(x_i,D_i)}∈Rⁿ×R,i＝1,L,l,y_i＝{-1,+1}

the non-labeled sample includes a sample of,

{x_l+1,L,x_l+u}

the learning process of the semi-supervised direct-pushing support vector machine is a process for solving an optimization problem, and comprises the following steps,

min(y₁,L,y_n,w,b,ξ₁,L,ξ_l,ξ_l+1,L,ξ_l+u)

s.t.:

wherein, C₁And C₂Set by the user for controlling the penalty for misclassifying samples, C₂Influence factors of label-free data in the training process; c₂ξ_jReferred to as the impact term of the jth unlabeled exemplar in the objective function.

As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the training process comprises setting a parameter C₁And C₂Training the labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples; calculating decision function values of all the label-free samples by using an initial classifier; marking the first N unlabeled samples with larger decision function values as positive samples, marking the rest unlabeled samples as negative samples, and setting C_tempIs a temporary influencing factor.

As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: retraining the SVM model on all marked samples, and exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible for the newly generated classifier until no sample meeting the exchange condition exists, otherwise, repeating the process; to C_tempIs uniformly increased when C is_temp≥C₂When the algorithm terminates, all unlabeled returnsAnd (5) marking the label of the sample.

As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the cluster analysis comprises the steps of extracting a certain number of samples from each cluster according to a certain proportion to a sample set with a label by utilizing the K-Means clustering algorithm to form n sub-sample sets, wherein n is an odd number and is more than 1, and the n is used as a training set; training according to n training sets to obtain n initial classifiers C₁，C₂，…，C_n(ii) a Predicting each unlabeled sample by using the n initial classifiers and outputting f₁，f₂，…，f_n(ii) a Labeling the unlabeled sample, and determining whether to iterate further according to a termination condition.

As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the clustering includes defining a set of labeled samples as L ═ x₁,x₂,L,x_lThe clustering number is K; the iteration round is r, and the initial value is 0; setting initial K clustering centers as

Define the corresponding set of the ith type sample as

For any one sample x_jJ is 1, L, L, if x_jDistance cluster center

Is shortest, the sample x_jAdding into

Class;

recalculating the K clustering centers, specifically as follows:

wherein the content of the first and second substances,

as a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: also comprises the following steps of (1) preparing,

a clustering criterion function is defined and a clustering error is calculated, as follows,

judging whether a stop condition is reached;

if the value of | E (t-1) -E (t) | is less than the preset error value, the finally obtained cluster and cluster center are respectively:

otherwise, r is set to r + 1.

The invention has the beneficial effects that: firstly, the algorithm provided by the invention not only depends on the classification result of a single classifier to determine the labeled sample, but also determines the labeled sample by training a plurality of classifiers and the voting result of the classifiers, so that the accuracy of labeling can be well improved; when a plurality of classifiers are trained, the training set needs to be divided into a plurality of small sub-training sets, how to divide the sub-training sets determines the performance of the trained classifier, an improved clustering algorithm is adopted, the geometric characteristics and the spatial distribution characteristics of the labeled samples are fully considered, the training sets are clustered, samples in a certain proportion are extracted according to the clustering result to construct a new training set and train the classifier, and therefore the performance of the classifier can be effectively improved; in each iteration, because the scale of each training set is not large, the training time overhead required by each classifier is relatively small; and fourthly, the sample with the label and the obtained labeled sample before training obtain a final classifier, so that the training speed of the classifier is improved, and meanwhile, the performance of the classifier can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a schematic flowchart of an intrusion detection method based on active learning according to an embodiment of the present invention;

fig. 2 is a schematic diagram of three scenarios of an active learning strategy of an intrusion detection method based on active learning according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a multi-classifier voting strategy labeling idea framework of an intrusion detection method based on active learning according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

Referring to fig. 1 to 3, a first embodiment of the present invention provides an intrusion detection method based on active learning, which specifically includes:

s1: and acquiring historical data by using the system log and preprocessing the historical data to obtain a tag sample data set.

S2: and constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier.

S3: and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.

The pretreatment comprises the following steps: and (6) normalization processing.

The active learning strategy comprises: membership queries, stream-based selective sampling, and pool-based selective sampling.

Referring to fig. 2, the membership query includes that the generated query is constructed by itself and may not exist in the original sample set, and the attribute values of all the generated samples are based on its standard, and the main objective is to construct the query that is best for improving the performance of the learner; the selective sampling based on the flow comprises that unmarked samples are submitted to a selection engine one by one according to the sequence, the selection engine determines whether to be marked or not, if not, the unmarked samples are discarded, the selective sampling based on the flow adapts to different conditions based on the flow through an adjusting method, but the selective sampling based on the flow can not realize the one-by-one comparison of the unmarked samples, the evaluation indexes and the corresponding threshold values of the samples need to be set according to a certain principle, if the evaluation indexes of the samples submitted to the selection engine exceed the threshold values, the selective sampling based on the pool comprises maintaining an unmarked sample pool, and according to a certain principle, the selection engine selects the samples needing to be marked from the pool.

The method is combined with a semi-supervised direct-pushing type support vector machine to train, detect and classify the model, and comprises the following steps:

a set of independently identically distributed labeled exemplars is defined, including,

{(x₁,D₁),L,(x_i,D_i)}∈Rⁿ×R,i＝1,L,l,y_i＝{-1,+1}

the non-labeled sample includes a sample of,

{x_l+1,L,x_l+u}

the learning process of the semi-supervised direct-push support vector machine is a process for solving an optimization problem, and comprises the following steps,

min(y₁,L,y_n,w,b,ξ₁,L,ξ_l,ξ_l+1,L,ξ_l+u)

s.t.:

Further, the training process comprises:

setting parameter C₁And C₂Training labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples;

calculating decision function values of all the label-free samples by using an initial classifier;

the first N unlabeled samples with larger decision function values are marked as positive samples, the rest unlabeled samples are marked as negative samples, and C is set_tempIs a temporary influence factor;

retraining the SVM model on all marked samples, and for a newly generated classifier, exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible until no sample meeting the exchange condition exists, otherwise, repeating the process;

to C_tempIs uniformly increased when C is_temp≥C₂At this point, the algorithm terminates and returns the labels for all unlabeled samples.

Referring to fig. 3, in order to increase the training speed of the TSVM and the accuracy of labeling samples in each iteration, in this embodiment, a multi-classifier collaborative voting mechanism is used to label unlabeled samples, so that on one hand, the time complexity of iterative training can be reduced; on the other hand, the plurality of classifiers determine the category of the sample through a voting mechanism, so that the accuracy of labeling the sample in each iteration is improved. The method specifically comprises the following steps:

dividing the whole sample set into a labeled sample set L and an unlabeled sample set U;

extracting a certain number of samples from each cluster according to a certain proportion to form n sub-sample sets by utilizing a K-Means clustering algorithm, wherein n is an odd number and is more than 1, and taking the n sub-sample sets as training sets;

training according to n training sets to obtain n initial classifiers C₁，C₂，…，C_n；

Predicting each unlabeled sample by using n initial classifiers and outputting f₁，f₂，…，f_n；

And labeling the unlabeled samples, and determining whether to iterate further according to a termination condition.

Specifically, the clustering includes:

defining a set of tagged samples as L ═ x₁,x₂,L,x_lThe clustering number is K;

the iteration round is r, and the initial value is 0;

setting initial K clustering centers as

Define the corresponding set of the ith type sample as

For any one sample x_jJ is 1, L, L, if x_jDistance cluster center

Is shortest, the sample x_jAdding into

Class;

recalculating the K clustering centers, specifically as follows:

wherein the content of the first and second substances,

judging whether a stop condition is reached;

otherwise, r is set to r + 1.

Example 2

Different from the first embodiment, the embodiment provides a validation specification for solving the problem of labeled samples and iterative training by an intrusion detection method based on active learning, and specifically comprises the following steps:

(1) and selecting the required marked sample.

For example: paired sample labeling in the TSVM has the advantage of high labeling accuracy, but the labeling speed is very low; the method has the advantages that a plurality of samples can be labeled at one time based on the region labeling method, the labeling speed is high, but the labeling accuracy cannot be guaranteed; the embodiment provides a voting decision marking method based on multiple classifiers, wherein m samples which belong to more than half of classifiers (namely minority obeys majority) in a boundary region are selected for marking, and if the m samples meet the maximum classification hyperplane, the m samples are marked as a positive class; if m samples satisfy the minimum classification hyperplane, then label as negative class.

Defining a classification hyperplane as f (x), and setting an unlabeled sample set as U ═ x_l+1,x_l+2,L,x_l+uH, then sample x_iThe distance to the classification hyperplane is expressed as:

in order to select the unlabeled samples most likely to be the support vectors for labeling and improve the labeling speed, in each iteration process, the unlabeled samples meeting the first p maximum values of the maximum classification hyperplane are selected and labeled as positive samples, and the unlabeled samples meeting the last q minimum values of the minimum classification hyperplane are selected and labeled as negative samples.

The values of p and q determine the number of samples labeled in one iteration, namely the learning speed of the direct-push learning, and when the values of p and q are 1, a pair labeling method is adopted; when the values of p and q are larger than 1, one iteration is expressed to label p or q samples in the optimal classification hyperplane boundary region, and the values of p and q can be optimized according to the actual application scene.

(2) And adding the marked samples into the corresponding classifier, and further performing iterative training.

After the labeling of the samples, if the stopping condition of the algorithm is not reached, the samples need to be added into the training set for iteration, and it should be noted that the samples are not added into all the training sets, but are added into the training set corresponding to the classifier with the output class label consistent with the output class label.

For example: A. b, C, D, E the results of the five classifiers output are: the positive type, the positive type and the negative type, wherein the samples are added into the training set corresponding to the training classifier E obviously, the samples are not suitable, in addition, if the output results of the A classifier and the B classifier simultaneously meet the maximum classification hyperplane, the output result of the A is 0.05, and the output result of the B is 0.95, the probability that the A labels the samples correctly is very low relative to the B, according to the analysis, the fact that the labeled samples are added into the training set corresponding to the classifier with the labeling category which is the same as the output result and the maximum output value can be found, and the training difference and the sample labeling accuracy can be guaranteed.

(4) An iteration termination condition is determined.

In the iterative process of sample labeling and model training, if the type (positive sample or negative sample) of the sample labeled in the current iteration is different from the type of the sample labeled in the previous iteration, the label needs to be reset, that is, the sample needs to be labeled again.

The method specifically comprises the following steps:

taking the sample as an unlabeled sample, deleting the unlabeled sample from a corresponding training set, and entering the next iteration;

if the sample needing to be reset does not appear in one iteration, and meanwhile, the unlabeled sample meeting the labeling condition does not exist, stopping the iteration;

after the iteration stops, combining the n training subsets, and then training and obtaining the final classifier.

The algorithm applied in the embodiment determines the category of the labeled sample by using the collaborative voting mechanism of the plurality of classifiers, so that the labeling accuracy is improved, and simultaneously, the sample is labeled in a batch mode, so that the labeling efficiency is improved.

The specific steps of the algorithm are described as follows:

the algorithm is as follows: and a TSVM algorithm based on multi-classifier collaborative annotation.

Inputting: a set of labeled samples L; a label-free sample set U; the number of classifiers n.

And (3) outputting: the final classifier TSVM.

S1: clustering the sample set L with the labels by adopting a K-Means algorithm, extracting samples from each cluster according to a certain proportion to form n sub-training sets, and recording as: l is₁,L₂,L,L_n；

S2: training n training subsets by adopting an SVM algorithm to obtain n initial classifiers: c₁,C₂,L,C_n；

S3: inputting unlabeled sample into C₁,C₂,L,C_nAnd the n initial classifiers obtain n output results: f. of₁ ⁱ,

L,

S4: for any unlabeled sample x_jIf the classification results of the n classifications meet the maximum classification hyperplane, marking the classification result as a positive classification; if the classification result of the n classifications meets the minimum classification hyperplane, labeling the classification result as a negative classification;

s5: if the labeling type of the current xj is inconsistent with the previous labeling type, the label needs to be reset and deleted from the corresponding training set;

if the currently labeled category is consistent with the previous stage,

is not consistent with the previous stage, then the sample is added to L_jPerforming the following steps;

if the sample is not marked in the early stage, the requirement is met

J and add the sample to L_jPerforming the following steps; otherwise, stopping iteration and jumping to S8;

s6: repeatedly executing S4 and S5 until all unlabeled samples are labeled;

s7: after obtaining new training subsets, retraining these new sub-training sets and obtaining new classifiers: c_{1 new},C_{2 new},L,C_{n new}；

If there is a case that the previous round and the training set of this round are not changed, the corresponding training needs to continue using the classifier of the previous round, and then the process jumps to S3;

s8: and summarizing the training subsets to form a final training set, and then retraining the sample set to obtain a final classifier.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. An intrusion detection method based on active learning is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

acquiring historical data by using a system log and preprocessing the historical data to obtain a tag sample data set;

constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier;

and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.

2. The active learning-based intrusion detection method according to claim 1, wherein: the pretreatment comprises normalization treatment;

the active learning strategies include membership queries, flow-based selective sampling, and pool-based selective sampling.

3. The intrusion detection method based on active learning according to claim 1 or 2, wherein: training the detection classification model in combination with the semi-supervised direct-push support vector machine, including,

{(x₁,D₁),L,(x_i,D_i)}∈Rⁿ×R,i＝1,L,l,y_i＝{-1,+1}

the non-labeled sample includes a sample of,

{x_l+1,L,x_l+u}

min(y₁,L,y_n,w,b,ξ₁,L,ξ_l,ξ_l+1,L,ξ_l+u)

4. The active learning-based intrusion detection method according to claim 3, wherein: the training process comprises the steps of training,

setting parameter C₁And C₂Training the labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples;

marking the first N unlabeled samples with larger decision function values as positive samples, marking the rest unlabeled samples as negative samples, and setting C_tempIs a temporary influencing factor.

5. The active learning-based intrusion detection method according to claim 4, wherein: also comprises the following steps of (1) preparing,

6. The active learning-based intrusion detection method according to claim 5, wherein: the cluster analysis includes a first step of performing cluster analysis on the data,

extracting a certain number of samples from each cluster according to a certain proportion to form n sub-sample sets by utilizing the K-Means clustering algorithm, wherein n is an odd number and is more than 1, and taking the n sub-sample sets as training sets;

Predicting each unlabeled sample by using the n initial classifiers and outputting f₁，f₂，…，f_n；

Labeling the unlabeled sample, and determining whether to iterate further according to a termination condition.

7. The active learning-based intrusion detection method according to claim 6, wherein: the clustering includes the steps of, for example,

the iteration round is r, and the initial value is 0;

setting initial K clustering centers as

Define the corresponding set of the ith type sample as

For any one sample x_jJ is 1, L, L, if x_jDistance cluster center

Is shortest, the sample x_jAdding into

Class;

recalculating the K clustering centers, specifically as follows:

wherein the content of the first and second substances,

8. the active learning-based intrusion detection method according to claim 7, wherein: also comprises the following steps of (1) preparing,

judging whether a stop condition is reached;

otherwise, r is set to r + 1.