CN113378955A - Intrusion detection method based on active learning - Google Patents
Intrusion detection method based on active learning Download PDFInfo
- Publication number
- CN113378955A CN113378955A CN202110695864.XA CN202110695864A CN113378955A CN 113378955 A CN113378955 A CN 113378955A CN 202110695864 A CN202110695864 A CN 202110695864A CN 113378955 A CN113378955 A CN 113378955A
- Authority
- CN
- China
- Prior art keywords
- samples
- sample
- training
- active learning
- detection method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 54
- 238000012549 training Methods 0.000 claims abstract description 67
- 238000002372 labelling Methods 0.000 claims abstract description 29
- 238000013145 classification model Methods 0.000 claims abstract description 16
- 238000012706 support-vector machine Methods 0.000 claims abstract description 12
- 238000003064 k means clustering Methods 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 26
- 230000008569 process Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 15
- 238000005070 sampling Methods 0.000 claims description 10
- 230000001939 inductive effect Effects 0.000 claims description 4
- 238000007621 cluster analysis Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims 1
- 239000002994 raw material Substances 0.000 claims 1
- 230000008901 benefit Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 101150040895 metJ gene Proteins 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
The invention discloses an intrusion detection method based on active learning, which comprises the steps of collecting historical data by using a system log and preprocessing the historical data to obtain a tag sample data set; constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier; and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model. The algorithm provided by the invention not only depends on the classification result of a single classifier to determine the labeled sample, but also determines the labeled sample by training a plurality of classifiers and voting results of the classifiers, so that the accuracy of labeling can be improved well.
Description
Technical Field
The invention relates to the technical field of classification detection, in particular to an intrusion detection method based on active learning.
Background
A direct push Support Vector Machine (TSVM) is a maximum interval classification method based on low density segmentation assumptions. Much like a traditional support vector machine, it finds the classification hyperplane with the largest separation as the optimal classification hyperplane, while training the classification model considering both unlabeled and labeled data.
The traditional machine learning method is to train and learn on a given labeled sample set and to induce a learning model, which is called inductive learning. However, in practical application, marked samples are very limited, and it is very time-consuming, labor-consuming and tedious to mark a large number of unlabeled samples, and in order to reduce the marking cost and reduce the training sample set as much as possible, an active learning method is provided to solve the problem of lack of labeled samples and optimize the classification model. For active learning, the learner can actively select the most favorable label-free sample (i.e., the sample with the largest information content) for the classifier to be promoted and submit to a user or a domain expert for labeling, and then the labeled sample is added into a training sample set as labeled data to participate in the next round of training, so that higher classification accuracy can be obtained under the condition that the training set is smaller, the cost for labeling the sample can be reduced, and the cost for training the high-performance classifier is also reduced.
On one hand, each type of algorithm has advantages and disadvantages, for example, some algorithms may have good effects on a certain attack type, but have poor detection effects on other types of attacks; on the other hand, many studies have focused on improving the overall detection accuracy, and have not performed well on a small sample (attack sample). In practice, however, considering the case that the attack sample is extremely unbalanced relative to the normal sample, attention should be paid to the detection capability of the intrusion detection classifier on the attack sample.
How to solve intrusion detection in the case of small samples: the first condition is as follows: the normal sample is far larger than the modeling problem under the condition of the attack sample; case two: the number of labeled samples is very rare, while the unlabeled samples are very rich, and how to model with these two types of samples.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The present invention has been made in view of the above-mentioned conventional problems.
Therefore, the invention provides an intrusion detection method based on active learning, which can solve the problem of the current intrusion detection classification accuracy.
In order to solve the technical problems, the invention provides the following technical scheme: acquiring historical data by using a system log and preprocessing the historical data to obtain a tag sample data set; constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier; and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the pretreatment comprises normalization treatment; the active learning strategies include membership queries, flow-based selective sampling, and pool-based selective sampling.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: training the detection classification model by combining the semi-supervised direct-pushing type support vector machine, wherein a group of independent and identically distributed labeled samples are defined, and the method comprises the following steps of,
{(x1,D1),L,(xi,Di)}∈Rn×R,i=1,L,l,yi={-1,+1}
the non-labeled sample includes a sample of,
{xl+1,L,xl+u}
the learning process of the semi-supervised direct-pushing support vector machine is a process for solving an optimization problem, and comprises the following steps,
min(y1,L,yn,w,b,ξ1,L,ξl,ξl+1,L,ξl+u)
s.t.:wherein, C1And C2Set by the user for controlling the penalty for misclassifying samples, C2Influence factors of label-free data in the training process; c2ξjReferred to as the impact term of the jth unlabeled exemplar in the objective function.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the training process comprises setting a parameter C1And C2Training the labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples; calculating decision function values of all the label-free samples by using an initial classifier; marking the first N unlabeled samples with larger decision function values as positive samples, marking the rest unlabeled samples as negative samples, and setting CtempIs a temporary influencing factor.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: retraining the SVM model on all marked samples, and exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible for the newly generated classifier until no sample meeting the exchange condition exists, otherwise, repeating the process; to CtempIs uniformly increased when C istemp≥C2When the algorithm terminates, all unlabeled returnsAnd (5) marking the label of the sample.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the cluster analysis comprises the steps of extracting a certain number of samples from each cluster according to a certain proportion to a sample set with a label by utilizing the K-Means clustering algorithm to form n sub-sample sets, wherein n is an odd number and is more than 1, and the n is used as a training set; training according to n training sets to obtain n initial classifiers C1,C2,…,Cn(ii) a Predicting each unlabeled sample by using the n initial classifiers and outputting f1,f2,…,fn(ii) a Labeling the unlabeled sample, and determining whether to iterate further according to a termination condition.
As a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: the clustering includes defining a set of labeled samples as L ═ x1,x2,L,xlThe clustering number is K; the iteration round is r, and the initial value is 0; setting initial K clustering centers asDefine the corresponding set of the ith type sample asFor any one sample xjJ is 1, L, L, if xjDistance cluster centerIs shortest, the sample xjAdding intoClass;
recalculating the K clustering centers, specifically as follows:
as a preferred embodiment of the intrusion detection method based on active learning according to the present invention, wherein: also comprises the following steps of (1) preparing,
a clustering criterion function is defined and a clustering error is calculated, as follows,
judging whether a stop condition is reached;
if the value of | E (t-1) -E (t) | is less than the preset error value, the finally obtained cluster and cluster center are respectively:otherwise, r is set to r + 1.
The invention has the beneficial effects that: firstly, the algorithm provided by the invention not only depends on the classification result of a single classifier to determine the labeled sample, but also determines the labeled sample by training a plurality of classifiers and the voting result of the classifiers, so that the accuracy of labeling can be well improved; when a plurality of classifiers are trained, the training set needs to be divided into a plurality of small sub-training sets, how to divide the sub-training sets determines the performance of the trained classifier, an improved clustering algorithm is adopted, the geometric characteristics and the spatial distribution characteristics of the labeled samples are fully considered, the training sets are clustered, samples in a certain proportion are extracted according to the clustering result to construct a new training set and train the classifier, and therefore the performance of the classifier can be effectively improved; in each iteration, because the scale of each training set is not large, the training time overhead required by each classifier is relatively small; and fourthly, the sample with the label and the obtained labeled sample before training obtain a final classifier, so that the training speed of the classifier is improved, and meanwhile, the performance of the classifier can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
fig. 1 is a schematic flowchart of an intrusion detection method based on active learning according to an embodiment of the present invention;
fig. 2 is a schematic diagram of three scenarios of an active learning strategy of an intrusion detection method based on active learning according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a multi-classifier voting strategy labeling idea framework of an intrusion detection method based on active learning according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1
Referring to fig. 1 to 3, a first embodiment of the present invention provides an intrusion detection method based on active learning, which specifically includes:
s1: and acquiring historical data by using the system log and preprocessing the historical data to obtain a tag sample data set.
S2: and constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier.
S3: and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.
The pretreatment comprises the following steps: and (6) normalization processing.
The active learning strategy comprises: membership queries, stream-based selective sampling, and pool-based selective sampling.
Referring to fig. 2, the membership query includes that the generated query is constructed by itself and may not exist in the original sample set, and the attribute values of all the generated samples are based on its standard, and the main objective is to construct the query that is best for improving the performance of the learner; the selective sampling based on the flow comprises that unmarked samples are submitted to a selection engine one by one according to the sequence, the selection engine determines whether to be marked or not, if not, the unmarked samples are discarded, the selective sampling based on the flow adapts to different conditions based on the flow through an adjusting method, but the selective sampling based on the flow can not realize the one-by-one comparison of the unmarked samples, the evaluation indexes and the corresponding threshold values of the samples need to be set according to a certain principle, if the evaluation indexes of the samples submitted to the selection engine exceed the threshold values, the selective sampling based on the pool comprises maintaining an unmarked sample pool, and according to a certain principle, the selection engine selects the samples needing to be marked from the pool.
The method is combined with a semi-supervised direct-pushing type support vector machine to train, detect and classify the model, and comprises the following steps:
a set of independently identically distributed labeled exemplars is defined, including,
{(x1,D1),L,(xi,Di)}∈Rn×R,i=1,L,l,yi={-1,+1}
the non-labeled sample includes a sample of,
{xl+1,L,xl+u}
the learning process of the semi-supervised direct-push support vector machine is a process for solving an optimization problem, and comprises the following steps,
min(y1,L,yn,w,b,ξ1,L,ξl,ξl+1,L,ξl+u)
wherein, C1And C2Set by the user for controlling the penalty for misclassifying samples, C2Influence factors of label-free data in the training process; c2ξjReferred to as the impact term of the jth unlabeled exemplar in the objective function.
Further, the training process comprises:
setting parameter C1And C2Training labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples;
calculating decision function values of all the label-free samples by using an initial classifier;
the first N unlabeled samples with larger decision function values are marked as positive samples, the rest unlabeled samples are marked as negative samples, and C is settempIs a temporary influence factor;
retraining the SVM model on all marked samples, and for a newly generated classifier, exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible until no sample meeting the exchange condition exists, otherwise, repeating the process;
to CtempIs uniformly increased when C istemp≥C2At this point, the algorithm terminates and returns the labels for all unlabeled samples.
Referring to fig. 3, in order to increase the training speed of the TSVM and the accuracy of labeling samples in each iteration, in this embodiment, a multi-classifier collaborative voting mechanism is used to label unlabeled samples, so that on one hand, the time complexity of iterative training can be reduced; on the other hand, the plurality of classifiers determine the category of the sample through a voting mechanism, so that the accuracy of labeling the sample in each iteration is improved. The method specifically comprises the following steps:
dividing the whole sample set into a labeled sample set L and an unlabeled sample set U;
extracting a certain number of samples from each cluster according to a certain proportion to form n sub-sample sets by utilizing a K-Means clustering algorithm, wherein n is an odd number and is more than 1, and taking the n sub-sample sets as training sets;
training according to n training sets to obtain n initial classifiers C1,C2,…,Cn;
Predicting each unlabeled sample by using n initial classifiers and outputting f1,f2,…,fn;
And labeling the unlabeled samples, and determining whether to iterate further according to a termination condition.
Specifically, the clustering includes:
defining a set of tagged samples as L ═ x1,x2,L,xlThe clustering number is K;
the iteration round is r, and the initial value is 0;
Define the corresponding set of the ith type sample asFor any one sample xjJ is 1, L, L, if xjDistance cluster centerIs shortest, the sample xjAdding intoClass;
recalculating the K clustering centers, specifically as follows:
a clustering criterion function is defined and a clustering error is calculated, as follows,
judging whether a stop condition is reached;
if the value of | E (t-1) -E (t) | is less than the preset error value, the finally obtained cluster and cluster center are respectively:otherwise, r is set to r + 1.
Example 2
Different from the first embodiment, the embodiment provides a validation specification for solving the problem of labeled samples and iterative training by an intrusion detection method based on active learning, and specifically comprises the following steps:
(1) and selecting the required marked sample.
For example: paired sample labeling in the TSVM has the advantage of high labeling accuracy, but the labeling speed is very low; the method has the advantages that a plurality of samples can be labeled at one time based on the region labeling method, the labeling speed is high, but the labeling accuracy cannot be guaranteed; the embodiment provides a voting decision marking method based on multiple classifiers, wherein m samples which belong to more than half of classifiers (namely minority obeys majority) in a boundary region are selected for marking, and if the m samples meet the maximum classification hyperplane, the m samples are marked as a positive class; if m samples satisfy the minimum classification hyperplane, then label as negative class.
Defining a classification hyperplane as f (x), and setting an unlabeled sample set as U ═ xl+1,xl+2,L,xl+uH, then sample xiThe distance to the classification hyperplane is expressed as:
in order to select the unlabeled samples most likely to be the support vectors for labeling and improve the labeling speed, in each iteration process, the unlabeled samples meeting the first p maximum values of the maximum classification hyperplane are selected and labeled as positive samples, and the unlabeled samples meeting the last q minimum values of the minimum classification hyperplane are selected and labeled as negative samples.
The values of p and q determine the number of samples labeled in one iteration, namely the learning speed of the direct-push learning, and when the values of p and q are 1, a pair labeling method is adopted; when the values of p and q are larger than 1, one iteration is expressed to label p or q samples in the optimal classification hyperplane boundary region, and the values of p and q can be optimized according to the actual application scene.
(2) And adding the marked samples into the corresponding classifier, and further performing iterative training.
After the labeling of the samples, if the stopping condition of the algorithm is not reached, the samples need to be added into the training set for iteration, and it should be noted that the samples are not added into all the training sets, but are added into the training set corresponding to the classifier with the output class label consistent with the output class label.
For example: A. b, C, D, E the results of the five classifiers output are: the positive type, the positive type and the negative type, wherein the samples are added into the training set corresponding to the training classifier E obviously, the samples are not suitable, in addition, if the output results of the A classifier and the B classifier simultaneously meet the maximum classification hyperplane, the output result of the A is 0.05, and the output result of the B is 0.95, the probability that the A labels the samples correctly is very low relative to the B, according to the analysis, the fact that the labeled samples are added into the training set corresponding to the classifier with the labeling category which is the same as the output result and the maximum output value can be found, and the training difference and the sample labeling accuracy can be guaranteed.
(4) An iteration termination condition is determined.
In the iterative process of sample labeling and model training, if the type (positive sample or negative sample) of the sample labeled in the current iteration is different from the type of the sample labeled in the previous iteration, the label needs to be reset, that is, the sample needs to be labeled again.
The method specifically comprises the following steps:
taking the sample as an unlabeled sample, deleting the unlabeled sample from a corresponding training set, and entering the next iteration;
if the sample needing to be reset does not appear in one iteration, and meanwhile, the unlabeled sample meeting the labeling condition does not exist, stopping the iteration;
after the iteration stops, combining the n training subsets, and then training and obtaining the final classifier.
The algorithm applied in the embodiment determines the category of the labeled sample by using the collaborative voting mechanism of the plurality of classifiers, so that the labeling accuracy is improved, and simultaneously, the sample is labeled in a batch mode, so that the labeling efficiency is improved.
The specific steps of the algorithm are described as follows:
the algorithm is as follows: and a TSVM algorithm based on multi-classifier collaborative annotation.
Inputting: a set of labeled samples L; a label-free sample set U; the number of classifiers n.
And (3) outputting: the final classifier TSVM.
S1: clustering the sample set L with the labels by adopting a K-Means algorithm, extracting samples from each cluster according to a certain proportion to form n sub-training sets, and recording as: l is1,L2,L,Ln;
S2: training n training subsets by adopting an SVM algorithm to obtain n initial classifiers: c1,C2,L,Cn;
S3: inputting unlabeled sample into C1,C2,L,CnAnd the n initial classifiers obtain n output results: f. of1 i,L,
S4: for any unlabeled sample xjIf the classification results of the n classifications meet the maximum classification hyperplane, marking the classification result as a positive classification; if the classification result of the n classifications meets the minimum classification hyperplane, labeling the classification result as a negative classification;
s5: if the labeling type of the current xj is inconsistent with the previous labeling type, the label needs to be reset and deleted from the corresponding training set;
if the currently labeled category is consistent with the previous stage,is not consistent with the previous stage, then the sample is added to LjPerforming the following steps;
if the sample is not marked in the early stage, the requirement is metJ and add the sample to LjPerforming the following steps; otherwise, stopping iteration and jumping to S8;
s6: repeatedly executing S4 and S5 until all unlabeled samples are labeled;
s7: after obtaining new training subsets, retraining these new sub-training sets and obtaining new classifiers: c1 new,C2 new,L,Cn new;
If there is a case that the previous round and the training set of this round are not changed, the corresponding training needs to continue using the classifier of the previous round, and then the process jumps to S3;
s8: and summarizing the training subsets to form a final training set, and then retraining the sample set to obtain a final classifier.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.
Claims (8)
1. An intrusion detection method based on active learning is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
acquiring historical data by using a system log and preprocessing the historical data to obtain a tag sample data set;
constructing a detection classification model based on an active learning strategy, and training the detection classification model by combining a semi-supervised direct-pushing support vector machine to form a detection multi-classifier;
and performing clustering analysis by using a K-Means clustering algorithm, and outputting a detection result by combining the trained detection classification model.
2. The active learning-based intrusion detection method according to claim 1, wherein: the pretreatment comprises normalization treatment;
the active learning strategies include membership queries, flow-based selective sampling, and pool-based selective sampling.
3. The intrusion detection method based on active learning according to claim 1 or 2, wherein: training the detection classification model in combination with the semi-supervised direct-push support vector machine, including,
a set of independently identically distributed labeled exemplars is defined, including,
{(x1,D1),L,(xi,Di)}∈Rn×R,i=1,L,l,yi={-1,+1}
the non-labeled sample includes a sample of,
{xl+1,L,xl+u}
the learning process of the semi-supervised direct-pushing support vector machine is a process for solving an optimization problem, and comprises the following steps,
min(y1,L,yn,w,b,ξ1,L,ξl,ξl+1,L,ξl+u)
wherein, C1And C2Set by the user for controlling the penalty for misclassifying samples, C2Influence factors of label-free data in the training process; c2ξjReferred to as the impact term of the jth unlabeled exemplar in the objective function.
4. The active learning-based intrusion detection method according to claim 3, wherein: the training process comprises the steps of training,
setting parameter C1And C2Training the labeled samples by adopting an inductive learning mode, obtaining an initial classifier, and setting the estimated number N of the positive samples in the unlabeled samples;
calculating decision function values of all the label-free samples by using an initial classifier;
marking the first N unlabeled samples with larger decision function values as positive samples, marking the rest unlabeled samples as negative samples, and setting CtempIs a temporary influencing factor.
5. The active learning-based intrusion detection method according to claim 4, wherein: also comprises the following steps of (1) preparing,
retraining the SVM model on all marked samples, and for a newly generated classifier, exchanging the labels of each pair of samples according to the principle of reducing the objective function as much as possible until no sample meeting the exchange condition exists, otherwise, repeating the process;
to CtempIs uniformly increased when C istemp≥C2At this point, the algorithm terminates and returns the labels for all unlabeled samples.
6. The active learning-based intrusion detection method according to claim 5, wherein: the cluster analysis includes a first step of performing cluster analysis on the data,
extracting a certain number of samples from each cluster according to a certain proportion to form n sub-sample sets by utilizing the K-Means clustering algorithm, wherein n is an odd number and is more than 1, and taking the n sub-sample sets as training sets;
training according to n training sets to obtain n initial classifiers C1,C2,…,Cn;
Predicting each unlabeled sample by using the n initial classifiers and outputting f1,f2,…,fn;
Labeling the unlabeled sample, and determining whether to iterate further according to a termination condition.
7. The active learning-based intrusion detection method according to claim 6, wherein: the clustering includes the steps of, for example,
defining a set of tagged samples as L ═ x1,x2,L,xlThe clustering number is K;
the iteration round is r, and the initial value is 0;
Define the corresponding set of the ith type sample asFor any one sample xjJ is 1, L, L, if xjDistance cluster centerIs shortest, the sample xjAdding intoClass;
recalculating the K clustering centers, specifically as follows:
8. the active learning-based intrusion detection method according to claim 7, wherein: also comprises the following steps of (1) preparing,
a clustering criterion function is defined and a clustering error is calculated, as follows,
judging whether a stop condition is reached;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110695864.XA CN113378955A (en) | 2021-06-23 | 2021-06-23 | Intrusion detection method based on active learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110695864.XA CN113378955A (en) | 2021-06-23 | 2021-06-23 | Intrusion detection method based on active learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113378955A true CN113378955A (en) | 2021-09-10 |
Family
ID=77578692
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110695864.XA Pending CN113378955A (en) | 2021-06-23 | 2021-06-23 | Intrusion detection method based on active learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113378955A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115880268A (en) * | 2022-12-28 | 2023-03-31 | 南京航空航天大学 | Method, system, equipment and medium for detecting defective products in plastic hose production |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102176701A (en) * | 2011-02-18 | 2011-09-07 | 哈尔滨工业大学 | Active learning based network data anomaly detection method |
CN104318242A (en) * | 2014-10-08 | 2015-01-28 | 中国人民解放军空军工程大学 | High-efficiency SVM active half-supervision learning algorithm |
CN111191238A (en) * | 2019-12-30 | 2020-05-22 | 厦门服云信息科技有限公司 | Webshell detection method, terminal device and storage medium |
CN112115467A (en) * | 2020-09-04 | 2020-12-22 | 长沙理工大学 | Intrusion detection method based on semi-supervised classification of ensemble learning |
-
2021
- 2021-06-23 CN CN202110695864.XA patent/CN113378955A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102176701A (en) * | 2011-02-18 | 2011-09-07 | 哈尔滨工业大学 | Active learning based network data anomaly detection method |
CN104318242A (en) * | 2014-10-08 | 2015-01-28 | 中国人民解放军空军工程大学 | High-efficiency SVM active half-supervision learning algorithm |
CN111191238A (en) * | 2019-12-30 | 2020-05-22 | 厦门服云信息科技有限公司 | Webshell detection method, terminal device and storage medium |
CN112115467A (en) * | 2020-09-04 | 2020-12-22 | 长沙理工大学 | Intrusion detection method based on semi-supervised classification of ensemble learning |
Non-Patent Citations (4)
Title |
---|
杜红乐等: "协同标注的直推式支持向量机算法", 《小型微型计算机系统》 * |
杜红乐等: "基于聚类和协同标注的TSVM算法", 《河南科学》 * |
王立梅 等: "基于k均值聚类的直推式支持向量机学习算法", 《计算机工程与应用》 * |
赵建华等: "结合主动学习和半监督学习的网络入侵检测算法", 《西华大学学报(自然科学版)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115880268A (en) * | 2022-12-28 | 2023-03-31 | 南京航空航天大学 | Method, system, equipment and medium for detecting defective products in plastic hose production |
CN115880268B (en) * | 2022-12-28 | 2024-01-30 | 南京航空航天大学 | Method, system, equipment and medium for detecting inferior goods in plastic hose production |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
Chong et al. | Simultaneous image classification and annotation | |
Li et al. | Confidence-based active learning | |
Torralba et al. | Sharing visual features for multiclass and multiview object detection | |
CN109034205A (en) | Image classification method based on the semi-supervised deep learning of direct-push | |
CN101447020B (en) | Pornographic image recognizing method based on intuitionistic fuzzy | |
CN108228569B (en) | Chinese microblog emotion analysis method based on collaborative learning under loose condition | |
US20210319215A1 (en) | Method and system for person re-identification | |
CN110647907B (en) | Multi-label image classification algorithm using multi-layer classification and dictionary learning | |
CN113326731A (en) | Cross-domain pedestrian re-identification algorithm based on momentum network guidance | |
CN107392241A (en) | A kind of image object sorting technique that sampling XGBoost is arranged based on weighting | |
CN113378913B (en) | Semi-supervised node classification method based on self-supervised learning | |
Peng et al. | Text classification in Asian languages without word segmentation | |
CN110942091A (en) | Semi-supervised few-sample image classification method for searching reliable abnormal data center | |
CN105930792A (en) | Human action classification method based on video local feature dictionary | |
Schinas et al. | CERTH@ MediaEval 2012 Social Event Detection Task. | |
CN107291936A (en) | The hypergraph hashing image retrieval of a kind of view-based access control model feature and sign label realizes that Lung neoplasm sign knows method for distinguishing | |
CN110008365B (en) | Image processing method, device and equipment and readable storage medium | |
JP5754310B2 (en) | Identification information providing program and identification information providing apparatus | |
CN112418331A (en) | Clustering fusion-based semi-supervised learning pseudo label assignment method | |
WO2020024444A1 (en) | Group performance grade recognition method and apparatus, and storage medium and computer device | |
CN113378955A (en) | Intrusion detection method based on active learning | |
CN110765285A (en) | Multimedia information content control method and system based on visual characteristics | |
CN113920573B (en) | Face change decoupling relativity relationship verification method based on counterstudy | |
Li et al. | Face recognition using improved pairwise coupling support vector machines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210910 |