CN112437053B

CN112437053B - Intrusion detection method and device

Info

Publication number: CN112437053B
Application number: CN202011248506.6A
Authority: CN
Inventors: 周献飞; 徐楷; 焦建林; 董宁; 韩盟; 徐浩; 陈奕倩
Original assignee: State Grid Corp of China SGCC; State Grid Beijing Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Beijing Electric Power Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2023-06-30
Anticipated expiration: 2040-11-10
Also published as: CN112437053A

Abstract

The invention discloses an intrusion detection method and device. Wherein the method comprises the following steps: acquiring a first characteristic data set; performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set; and training the intrusion detection model by using the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for intrusion detection of data to be detected. The invention solves the technical problem of lower accuracy of data detection in the related technology.

Description

Intrusion detection method and device

Technical Field

The present invention relates to the field of data processing, and in particular, to an intrusion detection method and apparatus.

Background

With the development of the internet, data connection and traffic are increasing, and malicious intrusion and threat brought by the malicious intrusion to computers and various devices are increasing, so that intrusion detection is required for the data. When a large amount of high-dimensional data is encountered, the conventional intrusion detection system usually encounters the problem of dimension disaster, so that the accuracy of data detection is low; in addition, the existing intrusion detection system cannot identify unknown attacks in the data detection process, and the unknown attacks are subjected to missing report, so that the accuracy rate of data detection is low. Therefore, the data detection accuracy of the existing intrusion detection system is low.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides an intrusion detection method and device, which at least solve the technical problem of lower data detection accuracy in the related art.

According to an aspect of an embodiment of the present invention, there is provided an intrusion detection method including: acquiring a first characteristic data set; performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set; training an intrusion detection model by using the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for intrusion detection of data to be detected; the step of performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set comprises the following steps: dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relation; and performing feature screening on the multiple groups of data sets by using a random forest model to obtain multiple groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features; performing dimension reduction on the multiple groups of target feature sets to obtain a second feature data set; the method for training the intrusion detection model by using the second characteristic data set includes: performing misuse detection on the second characteristic data set by using a plurality of different types of preset models to obtain a third characteristic data set, wherein characteristic data contained in the third characteristic data set are used for representing non-attack data or normal data; and performing iterative training on the plurality of base classifiers by using an ensemble learning algorithm based on the third feature data set to obtain a trained intrusion detection model.

Optionally, performing a dimension reduction process on the first feature data set to obtain a second feature data set includes: dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relation; and performing feature screening on the multiple groups of data sets by using a random forest model to obtain multiple groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features; and performing dimension reduction processing on the multiple groups of target feature sets to obtain a second feature data set.

Optionally, feature screening is performed on the multiple sets of data sets by using a random forest model, and obtaining multiple sets of target feature sets includes: predicting a plurality of groups of data sets by using a random forest model to obtain a grading value of each original feature contained in a plurality of groups of original feature sets, wherein the grading value is used for representing the importance degree of each original feature; obtaining a scoring mean value of each original feature based on scoring values of each original feature contained in the plurality of groups of original feature sets; a plurality of sets of target features are determined based on the scored mean for each of the original features.

Optionally, determining multiple sets of target feature sets based on the scored mean of each original feature includes: according to the grading average value of each original feature, ascending order sorting is carried out on the plurality of original features; and acquiring a first preset number of original features at the forefront in the sorted plurality of original features to obtain a plurality of target features.

Optionally, performing dimension reduction processing on the multiple sets of target feature sets to obtain a second feature data set includes: constructing a first matrix based on a plurality of groups of target feature sets; acquiring a covariance matrix of the first matrix; determining a second matrix based on the covariance matrix; and obtaining the product of the first matrix and the second matrix to obtain a second characteristic data set.

Optionally, determining the second matrix based on the covariance matrix comprises: acquiring eigenvalues and eigenvectors of a covariance matrix; sorting the feature vectors according to the sizes of the feature values to generate a third matrix; and acquiring a second preset number of forefront row matrixes in the third matrix, and generating a second matrix.

Optionally, before acquiring the covariance matrix of the first matrix, the method further comprises: zero-equalizing the first matrix to obtain a fourth matrix; and obtaining a covariance matrix of the fourth matrix.

Optionally, before obtaining the product of the first matrix and the second matrix, the method further comprises: centering the first matrix to obtain a fifth matrix; and obtaining the product of the fifth matrix and the second matrix to obtain a second characteristic data set.

Optionally, before feature screening is performed on the multiple sets of data sets by using the random forest model to obtain multiple sets of target feature sets, the method further includes: randomly dividing a plurality of groups of data sets for a plurality of times to obtain a plurality of groups of training sets and test sets; training the random forest model by utilizing a plurality of groups of training sets; testing the trained random forest model by using the test set to obtain the total score of the trained random forest model; it is determined whether training of the random forest model is complete based on the total score.

Optionally, training the intrusion detection model using the second feature data set, and obtaining the trained intrusion detection model includes: performing misuse detection on the second characteristic data set to obtain a third characteristic data set, wherein characteristic data contained in the third characteristic data set are used for representing non-attack data or normal data; and performing iterative training on the plurality of base classifiers by using an ensemble learning algorithm based on the third feature data set to obtain a trained intrusion detection model.

Optionally, performing misuse detection on the second feature data set, and obtaining the third feature data set includes: predicting a plurality of different types of preset models by using the second characteristic data set, and determining the detection rate of the plurality of different types of preset models; determining a preset model corresponding to the maximum detection rate as a target model; performing misuse detection on the second characteristic data set by using the target model to obtain a detection result of the second characteristic data set; and obtaining a third characteristic data set based on the detection result of the second characteristic data set.

Optionally, the plurality of different types of preset models include: decision tree models, support vector machine models, and naive bayes models.

Optionally, after acquiring the first feature data set, the method further comprises: and formatting the first characteristic data set to obtain a processed first characteristic data set, wherein the types of variables contained in the processed first characteristic data set are the same.

According to another aspect of the embodiment of the present invention, there is also provided an intrusion detection apparatus including: the acquisition module is used for acquiring a first characteristic data set; the processing module is used for performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set; the training module is used for training the intrusion detection model by using the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for intrusion detection of data to be detected; a training module, comprising: the detection unit is used for carrying out misuse detection on the second characteristic data set by utilizing a plurality of different types of preset models to obtain a third characteristic data set, wherein characteristic data contained in the third characteristic data set are used for representing non-attack data or normal data; and the second training unit is used for carrying out iterative training on the plurality of base classifiers by utilizing an integrated learning algorithm based on the third characteristic data set to obtain the trained intrusion detection model.

According to another aspect of the embodiment of the present invention, there is further provided a computer readable storage medium, where the computer readable storage medium includes a stored program, and when the program runs, the device in which the computer readable storage medium is controlled to execute the intrusion detection method described above.

According to another aspect of the embodiment of the present invention, there is also provided a processor, configured to execute a program, where the program executes the intrusion detection method described above.

In the embodiment of the invention, the first characteristic data set is firstly obtained, then the first characteristic data set is subjected to dimension reduction processing to obtain the second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set, finally the second characteristic data set is utilized to train the intrusion detection model to obtain the trained intrusion detection model, the trained intrusion detection model is used for carrying out intrusion detection on data to be detected, the problem of dimension disaster is avoided by carrying out dimension reduction processing on the first characteristic data set, and therefore the accuracy of data detection is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of an intrusion detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another intrusion detection method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of an intrusion detection device according to an embodiment of the present invention;

fig. 4 is a schematic diagram of another intrusion detection device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of intrusion detection, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.

Fig. 1 is a flowchart of an intrusion detection method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S102, a first feature data set is acquired.

The first characteristic data set in the above step is a data set of network intrusion detection, and the first characteristic data set may be at least one of the following: a network traffic based dataset, a grid based dataset, an internet traffic based dataset, a virtual private network based dataset, an android application based dataset, an internet of things (IOT) traffic based dataset, an internet connected device based dataset. The data set based on network traffic may be DARPA1998dataset, KDD Cup 1999dataset, NSL-KDD dataset or UNSW-NB15dataset, among others.

And step S104, performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set.

Wherein the second feature data set has a smaller dimension than the first feature data set.

In an alternative embodiment, the dimension of the feature data is reduced, so that the maximum information in the feature data set is maintained, the problem of dimension disaster is avoided, the calculated amount of the feature data is exponentially reduced by reducing the dimension of the feature data, and the calculation complexity of the feature data is reduced.

And step S106, training the intrusion detection model by using the second characteristic data set to obtain a trained intrusion detection model.

The trained intrusion detection model is used for intrusion detection of data to be detected.

In an alternative embodiment, the feature data set after dimension reduction is adopted to train the intrusion detection model in real time, so that the intrusion detection model can timely detect unknown data attack, more accurate prediction on intrusion data is realized, and the effect of reducing the false alarm rate of the intrusion detection model is achieved.

According to the embodiment of the invention, the first characteristic data set is firstly obtained, then the first characteristic data set is subjected to dimension reduction processing to obtain the second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set, finally the second characteristic data set is utilized to train the intrusion detection model to obtain the trained intrusion detection model, the trained intrusion detection model is used for carrying out intrusion detection on data to be detected, the problem of dimension disaster is avoided by carrying out dimension reduction processing on the first characteristic data set, and therefore the accuracy of data detection is improved.

The cross-validation method in the above step, also referred to as loop estimation, is a practical method of cutting data samples into smaller subsets.

The random forest in the above step is a classifier comprising a plurality of decision trees, the output class of which is determined by the mode of the class output by the individual tree.

In an alternative embodiment, the cross-validation method is used to divide the processed data set into mutually exclusive training subsets, and multiple sets of training set test sets are generated, so that errors caused by one test can be avoided. The method comprises the steps of testing divided data sets by utilizing random forests, obtaining contribution rates of a group of features for each training set test set, taking an average value of the contribution rates, selecting a plurality of features with certain relativity and smaller average value of the feature contribution rates, then using PCA (Principal components analysis, principal component analysis) to reconstruct a group of less number of new feature data sets which are mutually irrelevant to replace the original feature data sets, enabling the new features to reflect information represented by the original features to the greatest extent, ensuring that information among indexes is not overlapped, and replacing the original multi-bit features with the newly obtained multi-dimensional features to obtain the new data sets. The new data set reduces feature dimensions and ensures that each dimension feature contains more information.

In an alternative embodiment, a random forest model may be used to predict multiple sets of data sets to obtain how much each feature makes a contribution on each tree in the random forest, then an average is taken, and finally the contribution between the features is compared, which may be generally measured by using a Gini index (Gini index) or an out-of-bag data (Out Of Bag Estimation, OOB) error rate as an evaluation index, and by comparing the contribution between the features, features with larger contribution values may be used as features in the target feature set, and features with smaller contribution values may be removed.

Illustratively, a base index may be used to derive a score mean for each original feature; the importance score of a variable is expressed by VIM (Vi Improved, text editor), assuming that there are c features x ₁ ，x ₂ ，x ₃ ，...，x _c Now each feature x is calculated _i Matrix index score VIM of (v) _j ^Gini I.e. the average amount of change in node splitting non-purity of j features in all decision trees.

Wherein, the calculation formula of the base index is:

wherein k represents k categories, p _k Sample weights representing class k. Feature x _j The importance on the node m, namely the change amount of the base index before and after branching of the node m is as follows: />

Wherein, GI _l And GI _r Respectively, the Gini index of two new nodes after branching. If the feature x _j In decision treeThe node appearing in i is in set M, then x _j The importance in the ith tree is:

assuming that there are n trees, then +.>

Finally, all the obtained importance scores are normalized, as shown in table 1. />

The denominator is the sum of all characteristic gains and the numerator is the base index of characteristic j.

TABLE 1

Features (e.g. a character)	Contribution rate average	Features (e.g. a character)	Contribution rate average	Features (e.g. a character)	Contribution rate average
						dur	0.06789	dloss	0.0095	trans_depth	0.00208
proto	0.0168	sinpkt	0.01326	response_body_len	0.00425
						service	0.02767	dinpkt	0.02406	ct_srv_src	0.02814
state	0.01578	sjit	0.00723	ct_state_ttl	0.05433
						spkts	0.00879	djit	0.00784	ct_dst_ltm	0.01183
dpkts	0.04487	swin	0.01366	ct_src_dport_ltm	0.01409
						sbytes	0.07697	stcpb	0.00479	ct_dst_sport_ltm	0.04049
dbytes	0.01584	dtcpb	0.00481	ct_dst_src_ltm	0.09199
						rate	0.01298	dwin	0.00075	is_ftp_login	0.00014
sttl	0.01925	tcprtt	0.04222	ct_ftp_cmd	0.0001
						dttl	0.0884	synack	0.02442	ct_flw_http_mthd	0.00231
sload	0.01486	ackdat	0.01299	ct_src_ltm	0.00852
						dload	0.01313	smean	0.02917	ct_srv_dst	0.05406
sloss	0.02778	dmean	0.04264	is_sm_ips_ports	0.00416

The first preset number in the above steps may be set by a user, and the plurality of target features are features that need to be subjected to dimension reduction processing.

In an alternative embodiment, the feature importance scores VIM may be derived on each of the small data sets described above _j And taking an average value, and selecting m features with smaller feature importance scores to reduce the dimension.

In an alternative embodiment, the first matrix may be a matrix x, the second matrix may be a matrix p, the dimension-reducing process may be to reduce the dimension of a piece of m-dimensional data, form m rows and a columns of matrix x from raw data according to columns, zero-average each row of x, that is, subtract the average value of the row, calculate the covariance matrix, calculate the eigenvalue of the covariance matrix and the corresponding eigenvector r, arrange the eigenvector r into a matrix according to the corresponding eigenvalue from top to bottom according to rows, form a matrix p in the first k rows, multiply the matrix formed by k eigenvectors with the data matrix after centering, that is, reduce the data after u dimensions, where the formula error=can be used

Representing the compressed error, u is the number of reduced features, and then determining a value x, such as 0.01, such that error < x, then consider it acceptable to reduce the dimension to u. And replacing the original m-dimensional features with the new u-dimensional features to finally obtain a new data set of Y= (x-m+u) features, namely a second feature data set, and using the second feature data set for intrusion detection.

In an alternative embodiment, the eigenvector of the covariance may be r, the third matrix may be the matrix q, and the second matrix may be the matrix p. The eigenvalue of covariance and the corresponding eigenvector r can be obtained, the eigenvector r is arranged into a matrix q according to the corresponding eigenvalue from top to bottom, and the first k rows of the matrix q are taken to form a matrix p.

The zero-averaging process in the above steps refers to the process that the variable subtracts its mean value, which is actually a shifting process, and the center of all data after shifting is (0, 0), and errors caused by different dimensions, own variation or larger value difference can be eliminated through the zero-averaging process.

In an alternative embodiment, zero-averaging may be performed on each line of data in the first matrix, that is, the average value of the line is subtracted from each line of data to obtain a fourth matrix, and a covariance matrix of the fourth matrix may be obtained.

The centering process in the above steps has the same effect as the zero-mean process, and errors caused by different dimensions, own variation or larger numerical value difference can be eliminated.

In an alternative embodiment, zero-averaging may be performed on each data in the first matrix, that is, the average value of all the data is subtracted from each data to obtain a fifth matrix, and the product of the fifth matrix and the second matrix may be obtained to obtain the second feature data set.

In an alternative embodiment, the multi-set data set may be split into multiple small data sets using a k-means of re-cross validation (k-flod cross Validation). Firstly, the k-weight cross validation method randomly divides sample data into k parts, k-1 parts are randomly selected as training sets each time, the rest 1 part is used as a test set, and k-1 parts can be randomly selected again to train data after the first division is completed, so that k training data sets and k test data sets can be obtained.

The training process of the random forest model by utilizing the multiple groups of training sets can be as follows: selecting n samples from the sample set by using a sampling replacement method as a training set, and generating a decision tree by using the sampled sample set; d features are randomly and non-repeatedly selected from each node of the generated decision tree, the sample set is divided by using the d features to find the optimal dividing features, d features are repeatedly selected, the sample set is divided by using the d features, m times of steps for finding the optimal dividing features are carried out, and m is the number of decision trees in the random forest. And predicting the test sample by using the random forest obtained by training, and determining a predicted result by using a voting method.

Wherein { A } ₂ ，A ₃ ，A ₄ ，……A _k Building random forest model M based on ₁ And for data set A ₁ Verifying, comparing the predicted value with the true value, and calculating a score a under a certain evaluation standard ₁ ；

In { A ₁ ，A ₃ ，A ₄ ，……A _k Building a model M on the basis of } ₁ And for data set A ₂ Verifying, comparing the predicted value with the true value, and calculating a score a under the same evaluation standard ₂ ；

…

In { A ₁ ，A ₂ ，A ₃ ，……A _k-1 Building a model on the basis of the data set A _k Verifying, comparing the predicted value with the true value, and calculating a score a under the same evaluation standard _k 。

a1＝a ₁ +a ₂ +…+a _k K as model M ₁ Is a composite score of (2).

Wherein A is ₁ ，A ₂ ，A ₃ ，……A _k Respectively represent k data sets obtained by a k-weight cross validation method, M ₁ Representing a trained random forest model. For each a obtained ₁ ，a ₂ ，……a _k Each feature has a different importance.

The misuse detection in the steps is a method for detecting the computer attack, the known attack can be simply added into the model in the misuse detection, the misinformation rate of the detection is low, and the detection efficiency is high.

In an alternative embodiment, the second feature data set may be misused to detect that there is generally some non-attack data that is not in the second feature data set after the misuse detection, and the data may be extracted to generate a new feature data set, i.e. the third feature data set, in combination with the second feature data set.

In an alternative embodiment, the process of performing iterative training on the plurality of base classifiers using the ensemble learning algorithm may be: first, the weight distribution of training data is initialized: w= (w_11, w_12,., w_1n), w_1i=1/N, i=1, 2, … N, then for m=1, 2, …, M use is made of a weight distribution D _m Is learned by a training data set to obtain a basic classifier G _m (x) Calculate G _m (x) A classification error rate on the training set;

calculation G _m (x) Coefficients of (2): a, a _m ＝1/2log(1-e _m /e _m ) The weight distribution (z) _m Is a normalization factor that causes z _m+1 Becomes a probability distribution): w (W) _m+1,i ＝w _mi /z _m exp(-a _m y _i G _m (x _i ))，/>

Constructing a linear combination of basis classifiers to obtain a final classifier, < > >

Finally, the results may be predicted using a final classifier, wherein the final classifier is a trained intrusion detection model.

When a naive bayes model is employed, the classification model samples are assumed to be:

i.e. m samples, each sample having n features, the feature outputs having K categories, defined asC ₁ ,C ₂ ,...,C _k The method comprises the steps of carrying out a first treatment on the surface of the Obtaining a naive bayes prior distribution P from sample learning (y=c _k ) (k=1, 2,..k) then learn the conditional probability distribution P (x=x|y=c _k )＝P(X ₁ ＝x ₁ ,X ₂ ＝x ₂ ,...X _n ＝x _n |Y＝C _k ) Then, a Bayes formula is used to obtain the joint distribution P (X, Y) of X and Y: p (X, y=c) _k )＝P(Y＝C _k )P(X＝x|Y＝C _k )＝P(Y＝C _k )P(X ₁ ＝x ₁ ,X ₂ ＝x ₂ ,...X _n ＝x _n |Y＝C _k )＝P(X ₁ ＝x ₁ |Y＝C _k )P(X ₂ ＝x ₂ |Y＝C _k )…P(X _n ＝x _n |Y＝C _k ) P (y=c by maximum likelihood _k ) Find that it is category C _k And finding out the category corresponding to the maximum conditional probability in the frequency of occurrence of the training set, which is the naive Bayesian prediction.

When the support vector machine model is adopted, a classification function is adopted

Wherein l represents the number of training samples, x represents the vector of the instance to be classified, and x _i ，y _i Attribute vector and class identification representing the ith sample, K (x _i X) represents a kernel function, a _i B represents model parameters, and a is obtained through quadratic programming _i Further, w and b are obtained to obtain a classification model g (x) =w×x+b, and g (x)>0 and g (x)<And when 0, x respectively belongs to different categories, and selecting the plane with the largest distance from the two categories of objects.

When a decision tree model is used, attributes are selected according to the base index, the information gain or the information gain ratio, and then branches are built up and down according to the attributes until all samples on one node are classified into the same class, or the number of samples in a certain node is lower than a given value. The final model is obtained by preventing overfitting with first pruning, second pruning or a combination of the two.

In an alternative embodiment, the obtained three models are used to predict the second feature data set, the model with the highest test detection rate is selected as the target model, the rate of missing report in the detection process is reduced, and then no attack and normal data in the process are extracted as new data sets, namely the third feature data set.

In an alternative embodiment, the formatting process may be a digitizing process, and since the first feature dataset contains both data-type and character-type variables, the first feature dataset needs to be uniformly digitized for facilitating subsequent processing of the first feature dataset.

The left side of the table 2 is the accuracy of several algorithms obtained by performing misuse detection by using several machine learning algorithms after feature processing, decision trees can be selected for misuse detection by comparison, the right side of the table 2 is the accuracy of non-feature processing, the decision trees are used for misuse detection after feature processing, the accuracy of a test set after one-field detection is performed by using an integrated algorithm, and the accuracy of the intrusion detection method in the embodiment can be obtained by combining experimental results.

TABLE 2

A preferred embodiment of the present invention will be described in detail with reference to fig. 2 to 3. As shown in figure x, the method may include the steps of:

Step S201, dividing a data set to obtain a plurality of mutually exclusive training test sets;

alternatively, the data set may be first uniformly digitized for subsequent use.

Alternatively, the processed data set can be divided into mutually exclusive training subsets by adopting a cross-validation method, and multiple groups of training sets and testing sets are generated, so that errors caused by one test can be avoided.

As shown in fig. 3, the data set is divided into a training test set 1, a training test set 2, and a training test set N.

Step S202, testing is carried out based on the divided data sets by utilizing a random forest, and the contribution rate of a group of characteristics is obtained for each group of training set testing sets;

as shown in fig. 3, a random forest 1 is utilized to test the training test set 1, so as to obtain the contribution rate ranking of the features 1 to the features x in the training test set 1; testing the training test set 2 by using the random forest 2 to obtain the contribution rate ranking of the features 1 to the features x in the training test set 2; and testing the training test set N by using the random forest N to obtain the contribution rate sequencing of the features 1 to the features X in the training test set N.

Step S203, taking an average value of the characteristic contribution rates, and selecting a plurality of characteristics with certain relativity with smaller average value of the characteristic contribution rates as target characteristics;

As shown in fig. 3, the features of each training test set are averaged, and several features with certain correlation with smaller feature contribution rate average are selected; wherein the features of the first m rows in the feature ordering of each training test set may be selected as target features.

Step S204, a group of less number of new comprehensive indexes which are not related with each other are recombined by utilizing PCA to replace the original indexes, and the new data set is obtained by replacing the original multidimensional characteristics with the newly obtained multidimensional characteristics;

as shown in fig. 3, feature 1 through feature Y in each training test set may be combined using PCA to obtain a new data set.

The new data set reduces feature dimensions and ensures that each dimension feature contains more information.

Step S205, performing misuse detection based on the new data set;

when the algorithm is selected, a plurality of algorithms, such as decision trees, support vector machines and naive Bayes, can be respectively used for testing, and finally the algorithm with the highest testing accuracy is selected. The first layer detection is guaranteed to have a lower missing report rate.

As shown in fig. 3, decision trees, support vector machines, naive bayes may be used to test the new data sets, respectively, add the new data set with the highest accuracy to the intrusion rule base, and perform misuse detection based on the data set.

Step S206, after misuse test, extracting non-attack data and normal data in the characteristic rule base to be used as a new data set;

step S207, performing anomaly detection based on the data set after misuse test to obtain a strong classifier;

optionally, the anomaly detection is to train a plurality of base classifiers based on the data set after misuse detection, iterate the plurality of base classifiers by using an AdaBoost algorithm (Adaptive Boosting, adaptive enhancement) in an inheritance learning algorithm, give each time a higher weight to a sample with a wrong classification, and finally combine the samples into a strong classifier to perform anomaly detection.

And step S208, performing intrusion detection by using a strong classifier, and extracting non-attack data and normal data in the characteristic rule base to be used as a final data set.

As shown in fig. 3, after anomaly detection, some attack data which is not in the intrusion rule base can be extracted to combine with normal data to form a new data set.

Example 2

According to the embodiment of the present invention, there is further provided an intrusion detection device, which can execute the intrusion detection method in the above embodiment, and the specific implementation manner and the preferred application scenario are the same as those of the above embodiment, and are not described herein.

Fig. 4 is a schematic diagram of an intrusion detection device according to an embodiment of the present invention, as shown in fig. 4, the device includes:

an acquisition module 42 for acquiring a first feature data set;

a processing module 44, configured to perform a dimension reduction process on the first feature data set to obtain a second feature data set, where a dimension of the second feature data set is smaller than that of the first feature data set;

the training module 46 is configured to train the intrusion detection model by using the second feature data set to obtain a trained intrusion detection model, where the trained intrusion detection model is used for intrusion detection on data to be detected.

Optionally, the processing module includes: the dividing unit is used for dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relation; the screening unit is used for carrying out feature screening on a plurality of groups of data sets by utilizing a random forest model to obtain a plurality of groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features; and the processing unit is used for performing dimension reduction processing on the multiple groups of target feature sets to obtain a second feature data set.

Optionally, the screening unit comprises: the prediction subunit is used for predicting the plurality of groups of data sets by utilizing the random forest model to obtain a grading value of each original feature contained in the plurality of groups of original feature sets, wherein the grading value is used for representing the importance degree of each original feature; the first acquisition subunit is used for obtaining the score mean value of each original feature based on the score value of each original feature contained in the plurality of groups of original feature sets; a first determining subunit, configured to determine a plurality of target feature sets based on the score average of each original feature.

Optionally, the determining subunit further performs ascending sort on the plurality of original features according to the score average value of each original feature, and obtains a first preset number of original features forefront in the sorted plurality of original features, so as to obtain a plurality of target features.

Optionally, the processing unit comprises: a construction subunit configured to construct a first matrix based on the plurality of sets of target feature sets; a second obtaining subunit, configured to obtain a covariance matrix of the first matrix; a second determination subunit configured to determine a second matrix based on the covariance matrix; the second obtaining subunit is further configured to obtain a product of the first matrix and the second matrix, to obtain a second feature data set.

Optionally, the second determining subunit is further configured to obtain a eigenvalue and an eigenvector of the covariance matrix, sort the eigenvectors according to the magnitude of the eigenvalue, generate a third matrix, and obtain a second preset number of line matrices at the forefront in the third matrix to generate a second matrix.

Optionally, the processing unit comprises: the first processing subunit is used for carrying out zero-mean processing on the first matrix to obtain a fourth matrix; and the third acquisition subunit is used for acquiring the covariance matrix of the fourth matrix.

Optionally, the processing unit comprises: the second processing subunit is used for carrying out centering processing on the first matrix to obtain a fifth matrix; and the fourth acquisition subunit is used for acquiring the product of the fifth matrix and the second matrix to obtain a second characteristic data set.

Optionally, the processing module includes: the dividing unit is also used for randomly dividing the multiple groups of data sets for multiple times to obtain multiple groups of training sets and test sets; the first training unit is used for training the random forest model by utilizing a plurality of groups of training sets; the test unit is used for testing the trained random forest model by using the test set to obtain the total score of the trained random forest model; and the determining unit is used for determining whether training of the random forest model is completed or not based on the total score.

Optionally, the training module includes: the detection unit is used for carrying out misuse detection on the second characteristic data set to obtain a third characteristic data set, wherein characteristic data contained in the third characteristic data set are used for representing non-attack data or normal data; and the second training unit is used for carrying out iterative training on the plurality of base classifiers by utilizing an integrated learning algorithm based on the third characteristic data set to obtain a trained intrusion detection model.

Optionally, the detection unit includes: the prediction subunit is used for predicting a plurality of different types of preset models by using the second characteristic data set and determining the detection rate of the plurality of different types of preset models; the third determining subunit is used for determining a preset model corresponding to the maximum detection rate as a target model; the detection subunit is used for carrying out misuse detection on the second characteristic data set by utilizing the target model to obtain a detection result of the second characteristic data set; and a fifth obtaining subunit, configured to obtain a third feature data set based on the detection result of the second feature data set.

Optionally, the plurality of different types of preset models in the detection unit include: decision tree models, support vector machine models, and naive bayes models.

Optionally, the processing module is further configured to perform formatting processing on the first feature data set to obtain a processed first feature data set, where the types of variables included in the processed first feature data set are the same.

Example 3

According to an embodiment of the present invention, there is also provided a computer-readable storage medium, including a stored program, where the device in which the computer-readable storage medium is controlled to execute the intrusion detection method in embodiment 1 described above when the program runs.

Example 4

According to an embodiment of the present invention, there is also provided a processor for running a program, where the program executes the intrusion detection method in embodiment 1.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. An intrusion detection method, comprising:

acquiring a first characteristic data set;

performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set;

training an intrusion detection model by using the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for intrusion detection of data to be detected;

the step of performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set comprises the following steps: dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relation; and performing feature screening on the multiple groups of data sets by using a random forest model to obtain multiple groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features; performing dimension reduction on the multiple groups of target feature sets to obtain a second feature data set;

The training the intrusion detection model by using the second feature data set, and obtaining the trained intrusion detection model includes: performing misuse detection on the second characteristic data set by using a plurality of different types of preset models to obtain a third characteristic data set, wherein characteristic data contained in the third characteristic data set are used for representing non-attack data or normal data; and performing iterative training on the plurality of base classifiers by using an ensemble learning algorithm based on the third feature data set to obtain the trained intrusion detection model.

2. The method of claim 1, wherein feature screening the plurality of sets of data using the random forest model to obtain a plurality of sets of target features comprises:

predicting the multiple groups of data sets by utilizing the random forest model to obtain a grading value of each original feature contained in the multiple groups of original feature sets, wherein the grading value is used for representing the importance degree of each original feature;

obtaining a scoring mean value of each original feature based on scoring values of each original feature contained in the plurality of groups of original feature sets;

and determining the multiple target feature sets based on the scoring mean of each original feature.

3. The method of claim 2, wherein determining the plurality of sets of target features based on the scored mean for each original feature comprises:

according to the grading average value of each original feature, ascending order is carried out on the plurality of original features;

and acquiring a first preset number of original features at the forefront in the sorted plurality of original features, and obtaining the plurality of target features.

4. The method of claim 1, wherein performing a dimension reduction process on the plurality of sets of target feature sets to obtain the second feature data set comprises:

constructing a first matrix based on the multiple sets of target feature sets;

acquiring a covariance matrix of the first matrix;

determining a second matrix based on the covariance matrix;

and obtaining the product of the first matrix and the second matrix to obtain the second characteristic data set.

5. The method of claim 4, wherein determining a second matrix based on the covariance matrix comprises:

acquiring eigenvalues and eigenvectors of the covariance matrix;

sorting the feature vectors according to the magnitude of the feature values to generate a third matrix;

and acquiring a second preset number of forefront row matrixes in the third matrix, and generating the second matrix.

6. The method of claim 4, wherein prior to obtaining the covariance matrix of the first matrix, the method further comprises:

zero-equalizing the first matrix to obtain a fourth matrix;

and acquiring the covariance matrix of the fourth matrix.

7. The method of claim 4, wherein prior to obtaining the product of the first matrix and the second matrix to obtain the second feature data set, the method further comprises:

centering the first matrix to obtain a fifth matrix;

and obtaining the product of the fifth matrix and the second matrix to obtain the second characteristic data set.

8. The method of claim 1, wherein prior to feature screening the plurality of sets of data using the random forest model to obtain a plurality of sets of target features, the method further comprises:

randomly dividing the multiple groups of data sets for multiple times to obtain multiple groups of training sets and test sets;

training the random forest model by utilizing the multiple groups of training sets;

testing the trained random forest model by using the test set to obtain the total score of the trained random forest model;

Determining whether training of the random forest model is complete based on the total score.

9. The method of claim 1, wherein misuse detection of the second feature data set to obtain a third feature data set comprises:

predicting a plurality of different types of preset models by using the second characteristic data set, and determining the detection rate of the plurality of different types of preset models;

determining a preset model corresponding to the maximum detection rate as a target model;

performing misuse detection on the second characteristic data set by using the target model to obtain a detection result of the second characteristic data set;

and obtaining the third characteristic data set based on the detection result of the second characteristic data set.

10. The method of claim 9, wherein the plurality of different types of preset models comprises: decision tree models, support vector machine models, and naive bayes models.

11. The method of claim 1, wherein after acquiring the first feature data set, the method further comprises:

and formatting the first characteristic data set to obtain a processed first characteristic data set, wherein the types of variables contained in the processed first characteristic data set are the same.

12. An intrusion detection device, comprising:

the acquisition module is used for acquiring a first characteristic data set;

the processing module is used for performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set;

the training module is used for training the intrusion detection model by using the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for intrusion detection of data to be detected

The processing module is further configured to perform dimension reduction processing on the first feature data set, and obtaining a second feature data set includes: dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relation; and performing feature screening on the multiple groups of data sets by using a random forest model to obtain multiple groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features; performing dimension reduction on the multiple groups of target feature sets to obtain a second feature data set;

a training module, comprising: the detection unit is used for carrying out misuse detection on the second characteristic data set by utilizing a plurality of different types of preset models to obtain a third characteristic data set, wherein characteristic data contained in the third characteristic data set are used for representing non-attack data or normal data; and the second training unit is used for carrying out iterative training on the plurality of base classifiers by utilizing an integrated learning algorithm based on the third characteristic data set to obtain the trained intrusion detection model.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, controls a device in which the computer-readable storage medium is located to perform the intrusion detection method according to any one of claims 1 to 11.

14. A processor for running a program, wherein the program when run performs the intrusion detection method according to any one of claims 1 to 11.