CN112437053A

CN112437053A - Intrusion detection method and device

Info

Publication number: CN112437053A
Application number: CN202011248506.6A
Authority: CN
Inventors: 周献飞; 徐楷; 焦建林; 董宁; 韩盟; 徐浩; 陈奕倩
Original assignee: State Grid Corp of China SGCC; State Grid Beijing Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Beijing Electric Power Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-03-02
Anticipated expiration: 2040-11-10
Also published as: CN112437053B

Abstract

The invention discloses an intrusion detection method and device. Wherein, the method comprises the following steps: acquiring a first characteristic data set; performing dimensionality reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimensionality of the second characteristic data set is smaller than that of the first characteristic data set; and training the intrusion detection model by utilizing the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for carrying out intrusion detection on data to be detected. The invention solves the technical problem of lower accuracy of data detection in the related technology.

Description

Intrusion detection method and device

Technical Field

The invention relates to the field of data processing, in particular to an intrusion detection method and device.

Background

With the development of the internet, the connection and the flow of data are larger and larger, and the subsequent malicious intrusion and the threat brought by the malicious intrusion on computers and various devices are also increased, so that the intrusion detection on the data is required. When the existing intrusion detection system encounters a large amount of high-dimensional data, the problem of dimension disaster is usually encountered, so that the accuracy rate of data detection is low; in addition, the existing intrusion detection system cannot identify unknown attacks in the data detection process, and can report the unknown attacks in a missing manner, so that the accuracy of data detection is low. Therefore, the data detection accuracy of the existing intrusion detection system is low.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides an intrusion detection method and device, which at least solve the technical problem of low accuracy of data detection in the related technology.

According to an aspect of an embodiment of the present invention, there is provided an intrusion detection method, including: acquiring a first characteristic data set; performing dimensionality reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimensionality of the second characteristic data set is smaller than that of the first characteristic data set; and training the intrusion detection model by utilizing the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for carrying out intrusion detection on data to be detected.

Optionally, the performing dimension reduction processing on the first feature data set to obtain a second feature data set includes: dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relationship; and carrying out feature screening on the multiple groups of data sets by utilizing a random forest model to obtain multiple groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features; and performing dimension reduction processing on the multiple groups of target feature sets to obtain a second feature data set.

Optionally, the performing feature screening on the multiple sets of data sets by using a random forest model to obtain multiple sets of target feature sets includes: predicting the multiple groups of data sets by using a random forest model to obtain a score value of each original characteristic contained in the multiple groups of original characteristic sets, wherein the score value is used for representing the importance degree of each original characteristic; obtaining a score mean value of each original feature based on the score value of each original feature contained in the multiple groups of original feature sets; and determining a plurality of groups of target feature sets based on the score mean of each original feature.

Optionally, determining the plurality of sets of target feature sets based on the score mean of each raw feature comprises: according to the score average value of each original feature, sequencing the original features in an ascending order; and acquiring the first preset number of original features at the forefront in the sequenced plurality of original features to obtain a plurality of target features.

Optionally, the performing dimension reduction processing on the multiple sets of target feature sets to obtain a second feature data set includes: constructing a first matrix based on the multiple groups of target feature sets; acquiring a covariance matrix of the first matrix; determining a second matrix based on the covariance matrix; and acquiring the product of the first matrix and the second matrix to obtain a second characteristic data set.

Optionally, determining the second matrix based on the covariance matrix comprises: obtaining an eigenvalue and an eigenvector of a covariance matrix; sorting the eigenvectors according to the magnitude of the eigenvalues to generate a third matrix; and acquiring a second preset number of row matrixes at the forefront in the third matrix to generate a second matrix.

Optionally, before obtaining the covariance matrix of the first matrix, the method further includes: carrying out zero equalization processing on the first matrix to obtain a fourth matrix; and acquiring a covariance matrix of the fourth matrix.

Optionally, before obtaining the product of the first matrix and the second matrix to obtain the second feature data set, the method further includes: performing centralization processing on the first matrix to obtain a fifth matrix; and acquiring the product of the fifth matrix and the second matrix to obtain a second characteristic data set.

Optionally, before feature screening is performed on multiple sets of data sets by using a random forest model to obtain multiple sets of target feature sets, the method further includes: dividing a plurality of groups of data sets randomly for a plurality of times to obtain a plurality of groups of training sets and test sets; training the random forest model by using a plurality of groups of training sets; testing the trained random forest model by using the test set to obtain the total score of the trained random forest model; determining whether training of the random forest model is completed based on the total score.

Optionally, the method for training the intrusion detection model by using the second feature data set includes: carrying out misuse detection on the second characteristic data set to obtain a third characteristic data set, wherein the characteristic data contained in the third characteristic data set is used for representing non-attack data or normal data; and performing iterative training on the plurality of base classifiers by using an ensemble learning algorithm based on the third feature data set to obtain a trained intrusion detection model.

Optionally, the detecting misuse the second feature data set, and obtaining a third feature data set includes: predicting a plurality of different types of preset models by utilizing the second characteristic data set, and determining the detection rate of the plurality of different types of preset models; determining a preset model corresponding to the maximum detection rate as a target model; carrying out misuse detection on the second characteristic data set by using the target model to obtain a detection result of the second characteristic data set; and obtaining a third characteristic data set based on the detection result of the second characteristic data set.

Optionally, the plurality of different types of preset models includes: decision tree models, support vector machine models and naive Bayes models.

Optionally, after acquiring the first feature data set, the method further comprises: and formatting the first characteristic data set to obtain a processed first characteristic data set, wherein the types of variables contained in the processed first characteristic data set are the same.

According to another aspect of the embodiments of the present invention, there is also provided an intrusion detection apparatus, including: an obtaining module, configured to obtain a first feature data set; the processing module is used for performing dimensionality reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimensionality of the second characteristic data set is smaller than that of the first characteristic data set; and the training module is used for training the intrusion detection model by utilizing the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for carrying out intrusion detection on the data to be detected.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, which includes a stored program, wherein when the program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the intrusion detection method.

According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the intrusion detection method described above.

In the embodiment of the invention, a first characteristic data set is obtained firstly, then dimension reduction processing is carried out on the first characteristic data set to obtain a second characteristic data set, wherein the dimension of the second characteristic data set is smaller than that of the first characteristic data set, finally, a trained intrusion detection model is obtained by utilizing the second characteristic data set, wherein the trained intrusion detection model is used for carrying out intrusion detection on data to be detected, the problem of dimension disaster is avoided by carrying out dimension reduction processing on the first characteristic data set, so that the accuracy of data detection is improved, in addition, the intrusion detection model can be timely detected by carrying out real-time training on the intrusion detection model according to the obtained characteristic data set, unknown data attack is avoided from being missed, and the accuracy of data detection can be further improved, and the technical problem of low accuracy of data detection in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of an intrusion detection method according to an embodiment of the invention;

FIG. 2 is a flow chart of another intrusion detection method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of an intrusion detection device according to an embodiment of the invention;

fig. 4 is a schematic diagram of another intrusion detection device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of intrusion detection, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flowchart of an intrusion detection method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step S102, a first characteristic data set is obtained.

The first feature data set in the above steps is a data set for network intrusion detection, and the first feature data set may be at least one of the following: a network traffic based dataset, a grid based dataset, an internet traffic based dataset, a virtual private network based dataset, an android applications based dataset, an internet of things (IOT) traffic based dataset, an internet connected device based dataset. Among other things, the data set based on network traffic may be DARPA 1998dataset, KDD Cup 1999dataset, NSL-KDD dataset, or UNSW-NB15 dataset.

And step S104, performing dimension reduction processing on the first characteristic data set to obtain a second characteristic data set.

Wherein the dimension of the second feature data set is smaller than the first feature data set.

In an alternative embodiment, the dimensionality of the feature data is reduced, so that the problem of dimension disaster can be avoided while the maximum information quantity in the feature data set is kept, the calculated quantity of the feature data is exponentially reduced by reducing the dimensionality of the feature data, and the complexity of feature data calculation can be reduced.

And step S106, training the intrusion detection model by using the second characteristic data set to obtain the trained intrusion detection model.

The trained intrusion detection model is used for carrying out intrusion detection on data to be detected.

In an optional embodiment, the intrusion detection model is trained in real time by using the feature data set after the dimension reduction, so that the intrusion detection model can detect unknown data attacks in time, the intrusion data can be more accurately predicted, and the effect of reducing the false alarm rate of the intrusion detection model is achieved.

Through the above embodiment of the present invention, first, a first feature data set is obtained, then, the dimension of the first feature data set is reduced to obtain a second feature data set, wherein the dimension of the second feature data set is smaller than that of the first feature data set, and finally, the intrusion detection model is trained by using the second feature data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for intrusion detection of data to be detected, and the problem of disaster dimension is avoided by performing the dimension reduction processing on the first feature data set, so that the accuracy of data detection is improved, in addition, the intrusion detection model can be timely detected by performing the real-time training on the intrusion detection model according to the obtained feature data set, so as to avoid missing report of unknown data attacks, and further improve the accuracy of data detection, and the technical problem of low accuracy of data detection in the related technology is solved.

The cross-validation method in the above steps, also called loop estimation, is a practical method to cut the data samples into smaller subsets.

The random forest in the above steps is a classifier comprising a plurality of decision trees, and the output class of the classifier is determined by the mode of the class output by the individual trees.

In an alternative embodiment, the cross-validation method is used to divide the processed data set into mutually exclusive training subsets, and generate multiple training set test sets, so as to avoid errors caused by one test. The divided data sets are tested by utilizing a random forest, the contribution rates of a group of features are obtained for each group of training set test sets, the contribution rates are averaged, a few features with certain correlation are selected from the features with smaller average values of the feature contribution rates, then a group of less number of new feature data sets which are irrelevant to each other are formed again by PCA (Principal component analysis) to replace the original feature data set, so that the new features reflect the information represented by the original features to the maximum extent, the information among all indexes is ensured not to be overlapped, and the newly obtained multi-dimensional features replace the original multi-bit features to obtain a new data set. The new data set not only reduces the feature dimension, but also ensures that each dimension feature contains more information.

In an alternative embodiment, a random forest model may be used to predict a plurality Of sets Of data sets, so as to obtain how much each feature contributes to each tree in a random forest, then an average value is taken, and finally the contribution sizes between the features are compared, which may be measured by using a cuni index (Gini index) or Out Of Bag (OOB) error rate as an evaluation index, and by comparing the contribution sizes between the features, a feature with a larger contribution value may be used as a feature in a target feature set, and a feature with a smaller contribution value may be removed.

For example, the score mean of each raw feature can be obtained by using a kini index; the importance scores of the variables are represented by VIM (Vi Improved, text editor), assuming c features x₁，x₂，x₃，...，x_cNow, each feature x is calculated_iOf (2) aNib index score

I.e., the average amount of change in node fragmentation purity of the j features across all decision trees.

Wherein, the calculation formula of the Gini index is as follows:

where k represents k classes, p_kRepresenting the sample weight of class k. Characteristic x_jThe importance of the node m, i.e., the variation of the kini index before and after branching of the node m, is:

wherein, GI_lAnd GI_rRespectively representing the Gini indexes of two new nodes after branching. If, feature x_jThe nodes that appear in decision tree i are in set M, then x_jThe importance of the ith tree is:

assuming there are n trees in total, then

Finally, all the calculated importance scores are normalized as shown in table 1.

The denominator is the sum of all the characteristic gains, and the numerator is the kini index of the characteristic j.

TABLE 1

Feature(s)	Mean value of contribution rate	Feature(s)	Mean value of contribution rate	Feature(s)	Mean value of contribution rate
						dur	0.06789	dloss	0.0095	trans_depth	0.00208
proto	0.0168	sinpkt	0.01326	response_body_len	0.00425
						service	0.02767	dinpkt	0.02406	ct_srv_src	0.02814
state	0.01578	sjit	0.00723	ct_state_ttl	0.05433
						spkts	0.00879	djit	0.00784	ct_dst_ltm	0.01183
dpkts	0.04487	swin	0.01366	ct_src_dport_ltm	0.01409
						sbytes	0.07697	stcpb	0.00479	ct_dst_sport_ltm	0.04049
dbytes	0.01584	dtcpb	0.00481	ct_dst_src_ltm	0.09199
						rate	0.01298	dwin	0.00075	is_ftp_login	0.00014
sttl	0.01925	tcprtt	0.04222	ct_ftp_cmd	0.0001
						dttl	0.0884	synack	0.02442	ct_flw_http_mthd	0.00231
sload	0.01486	ackdat	0.01299	ct_src_ltm	0.00852
						dload	0.01313	smean	0.02917	ct_srv_dst	0.05406
sloss	0.02778	dmean	0.04264	is_sm_ips_ports	0.00416

The first preset number in the above steps may be set by a user, and the plurality of target features are features that need to be subjected to dimension reduction processing.

In an alternative embodiment, the feature importance scores VIM derived from each of the small data sets described above may be used_jAnd taking an average value, and selecting m features with smaller feature importance scores to reduce dimensions.

In an alternative embodiment, the first matrix may be a matrix x, the second matrix may be a matrix p, the dimension reduction process may be performed by performing dimension reduction on a pieces of m-dimensional data, forming the raw data into an m-row a-column matrix x by columns, zero-equalizing each row of x, that is, subtracting the mean value of the row, obtaining a covariance matrix, obtaining eigenvalues of the covariance matrix and corresponding eigenvectors r, arranging the eigenvectors r into a matrix from top to bottom by rows according to the size of the corresponding eigenvalues, forming a matrix p by the first k rows, multiplying the matrix formed by k eigenvectors by the centralized data matrix, that is, the data reduced to the u-dimension, and in this case, using the formula error r ═ may be used

Representing the error after compression, u is the number of the features after dimensionality reduction, and then determining an x, such as 0.01, with which x is determined, such that error < x, it is considered acceptable to reduce the dimensionality to the u dimension. Replacing the original m-dimensional features with the new u-dimensional features to finally obtain a new data set of (x-m + u) features, namely a second data setAnd a feature data set, the second feature data set being used for intrusion detection.

In an alternative embodiment, the eigenvector of the covariance may be r, the third matrix may be the matrix q, and the second matrix may be the matrix p. The eigenvalue of covariance and corresponding eigenvector r can be obtained, the eigenvector r is arranged into a matrix q from top to bottom according to the size of the corresponding eigenvalue, and the first k rows of the matrix q are taken to form a matrix p.

The zero-averaging processing in the above steps is to subtract the average value of the variable, which is actually a translation process, the center of all data after translation is (0,0), and the error caused by different dimensions, self variation or large numerical value difference can be cancelled through the zero-averaging processing.

In an alternative embodiment, zero-averaging may be performed on each row of data in the first matrix, that is, the mean value of each row is subtracted from the data of each row to obtain a fourth matrix, and a covariance matrix of the fourth matrix may be obtained.

The centralization treatment in the steps has the same effect as zero averaging treatment, and errors caused by different dimensions, self variation or larger numerical value difference can be eliminated.

In an alternative embodiment, each data in the first matrix may be zero-averaged, that is, the average value of all data is subtracted from each data to obtain a fifth matrix, and a product of the fifth matrix and the second matrix may be obtained to obtain the second feature data set.

In an alternative embodiment, k-flow cross Validation may be used to split sets of data into smaller sets of data. Firstly, the k-weight cross verification method randomly divides sample data into k parts, randomly selects k-1 parts each time as a training set, and randomly selects the remaining 1 parts as a test set, and after the first division is completed, the k-1 parts can be randomly selected again to train data, so that k training data sets and k test data sets can be obtained.

The process of training the random forest model by using the multiple sets of training sets can be as follows: selecting n samples from the sample set as a training set by using a sampling and returning method, and generating a decision tree by using the sample set obtained by sampling; randomly and repeatedly selecting d features at each node of the generated decision tree, respectively dividing the sample set by using the d features to find the optimal division feature, repeatedly selecting the d features, respectively dividing the sample set by using the d features to find the optimal division feature, wherein m is the number of decision trees in the random forest. And predicting the test sample by using the random forest obtained by training, and determining the predicted result by using a voting method.

Wherein, in { A₂，A₃，A₄，……A_kConstructing a random forest model M on the basis of₁And to data set A₁Is testedComparing the predicted value with the true value, and calculating a score a under a certain evaluation criterion₁；

In { A₁，A₃，A₄，……A_kConstruction of model M on the basis of₁And to data set A₂Verifying, comparing the predicted value with the true value, and calculating a score a under the same evaluation standard₂；

…

In { A₁，A₂，A₃，……A_k-1Constructing a model on the basis of the data set A_kVerifying, comparing the predicted value with the true value, and calculating a score a under the same evaluation standard_k。

a1＝a₁+a₂+…+a_kK as model M₁The composite score of (1).

Wherein A is₁，A₂，A₃，……A_kRespectively representing k data sets, M, obtained by a k-fold cross-validation method₁Representing a trained random forest model. For each a obtained₁，a₂，……a_kEach feature has a different importance.

The misuse detection in the steps is a method for detecting computer attacks, known attacks can be simply added into the model in the misuse detection, the false alarm rate of the detection is low, and the detection efficiency is high.

In an alternative embodiment, the second feature data set may be misuse-detected, and after the misuse-detection, there will generally be some non-attack data not in the second feature data set, and these data may be extracted and combined with the second feature data set to generate a new feature data set, i.e. a third feature data set.

In an alternative embodiment, the process of iteratively training the plurality of base classifiers by using the ensemble learning algorithm may be: firstly, initializing weight distribution of training data: w — 1i 1/N, i 1,2, … N, and then M1, 2, …, M using a weight distribution D with a weight distribution for M_mLearning the training data set to obtain a basic classifier G_m(x) Calculate G_m(x) A classification error rate on the training set;

calculation of G_m(x) Coefficient (c): a is_m＝1/2log(1-e_m/e_m) And then updating the weight distribution (z) of the training data set_mIs a normalization factor which makes z_m+1Becomes a probability distribution): w_m+1，i＝w_mi/z_m exp(-a_my_iG_m(x_i))，

Constructing linear combination of the base classifiers to obtain a final classifier,

and finally, predicting the result by using a final classifier, wherein the final classifier is a trained intrusion detection model.

When a naive bayes model is employed, assume that the classification model samples are:

that is, there are m samples, each sample has n features, the feature output has K categories, defined as C₁,C₂,...,C_k(ii) a Obtaining a prior distribution P (Y ═ C) of naive Bayes from sample learning_k) (K ═ 1,2,. K), then a conditional probability distribution P is learned (X ═ X | Y ═ C)_k)＝P(X₁＝x₁,X₂＝x₂,...X_n＝x_n|Y＝C_k) Then, a Bayesian formula is used to obtain a combined distribution P (X, Y) of X and Y: p (X, Y ═ C)_k)＝P(Y＝C_k)P(X＝x|Y＝C_k)＝P(Y＝C_k)P(X₁＝x₁,X₂＝x₂,...X_n＝x_n|Y＝C_k)＝P(X₁＝x₁|Y＝C_k)P(X₂＝x₂|Y＝C_k)…P(X_n＝x_n|Y＝C_k) By maximum likelihood, P (Y ═ C)_k) Found that it is of type C_kAnd finding out the category corresponding to the maximum conditional probability in the frequency of the training set, which is the prediction of naive Bayes.

When using support vector machine model, classification function is used

Where l represents the number of training samples and x represents the vector of the instance to be classified，x_i，y_iAttribute vector and class identification, K (x), representing the ith sample_iX) represents a kernel function, a_iB represents model parameters, and a is obtained by quadratic programming_iFurther, w and b are obtained to obtain a classification model g (x) ═ w x + b, when g (x)>0 and g (x)<X at 0 time belongs to different categories respectively, and the plane with the largest distance in the two categories of objects is selected.

When the decision tree model is adopted, attributes are selected according to the kini index, the information gain or the information gain ratio, and then branches are established from top to bottom according to the attributes until all samples on one node are classified into the same class or the number of samples in a certain node is lower than a given value. And (4) preventing overfitting by using a method of pruning firstly, pruning secondly or a combination of the pruning and the later pruning to obtain a final model.

In an optional embodiment, the obtained three models are used for predicting the second characteristic data set respectively, the model with the highest test detection rate is selected as the target model, the missing report rate in the detection process is reduced, and then the non-attack and normal data in the process are extracted to serve as a new data set, namely the third characteristic data set.

In an alternative embodiment, the formatting process may be a numerical process, and since the first feature data set includes both data-type variables and character-type variables, the first feature data set needs to be subjected to a data processing process in a unified manner in order to facilitate subsequent processing of the first feature data set.

The left side of table 2 is the accuracy of the algorithms obtained by performing misuse detection with several machine learning algorithms after feature processing, and the misuse detection can be performed by selecting a decision tree through comparison, the right side of table 2 is the accuracy without feature processing, and the accuracy of the test set after performing a field of detection with an integration algorithm is improved by performing the misuse detection with the decision tree after feature processing, and combining the experimental result.

TABLE 2

A preferred embodiment of the present invention will be described in detail with reference to fig. 2 to 3. As shown in fig. x, the method may include the steps of:

step S201, dividing a data set to obtain a plurality of mutually exclusive training test sets;

optionally, the data set may be first unified and digitized for subsequent use.

Optionally, a cross-validation method may be adopted to divide the processed data set into mutually exclusive training subsets, and generate multiple sets of training sets and test sets, so that errors caused by one-time testing can be avoided.

As shown in FIG. 3, the data set is divided into a training test set 1, a training test set 2, and a training test set N.

Step S202, testing by using a random forest based on the divided data sets, and obtaining the contribution rate of a group of features for each group of training set testing sets;

as shown in fig. 3, a random forest 1 is used to test a training test set 1, so as to obtain a contribution rate sequence from a feature 1 to a feature x in the training test set 1; testing the training test set 2 by using the random forest 2 to obtain the contribution rate sequence from the feature 1 to the feature x in the training test set 2; and testing the training test set N by using the random forest N to obtain the contribution rate sequence from the characteristic 1 to the characteristic X in the training test set N.

Step S203, averaging the feature contribution rates, and selecting a plurality of features with certain correlation with smaller average value of the feature contribution rates as target features;

as shown in fig. 3, the features of each training test set are averaged, and several features with certain correlation are selected, wherein the average of the feature contribution rates is smaller; wherein the features of the first m rows in the feature ordering of each training test set can be selected as target features.

Step S204, a group of less independent new comprehensive indexes are recombined by utilizing PCA to replace the original indexes, and the newly obtained multi-dimensional characteristics replace the original multi-dimensional characteristics to obtain a new data set;

as shown in fig. 3, feature 1 through feature Y in each training test set may be combined using PCA to obtain a new data set.

The new data set reduces the feature dimensionality and ensures that each dimensionality feature contains more information.

Step S205, based on the new data set, carrying out misuse detection;

when the algorithm is selected, various algorithms such as a decision tree, a support vector machine and naive Bayes can be respectively used for testing, and the algorithm with the highest testing accuracy is finally selected. And the lower missing report rate of the first layer detection is ensured.

As shown in fig. 3, a decision tree, a support vector machine, and naive bayes may be used to test new data sets, respectively, add the new data set with the highest accuracy to the intrusion rule base, and perform misuse detection based on the data set.

Step S206, after the misuse test is carried out, extracting non-attack data and normal data in the characteristic rule base to be used as a new data set;

step S207, based on the data set after the misuse test, carrying out anomaly detection to obtain a strong classifier;

optionally, the anomaly detection is to train and form a plurality of base classifiers based on the data set after the misuse detection, perform iterative training on the plurality of base classifiers by using an Adaptive Boosting algorithm (Adaptive Boosting) in the inheritance learning algorithm, give higher weight to the samples with classification errors each time, and finally combine the samples into a strong classifier for anomaly detection.

And S208, carrying out intrusion detection by using the strong classifier, and extracting non-attack data and normal data in the characteristic rule base to be used as a final data set.

As shown in fig. 3, after the anomaly detection, there are some attack data that are not in the intrusion rule base, and a new data set can be formed by extracting and combining the attack data with the normal data.

Example 2

According to the embodiment of the present invention, an intrusion detection apparatus is further provided, which can execute the intrusion detection method in the above embodiment, and the specific implementation manner and the preferred application scenario are the same as those in the above embodiment, and are not described herein again.

Fig. 4 is a schematic diagram of an intrusion detection apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus including:

an obtaining module 42 for obtaining a first feature data set;

a processing module 44, configured to perform dimension reduction processing on the first feature data set to obtain a second feature data set, where a dimension of the second feature data set is smaller than that of the first feature data set;

and a training module 46, configured to train an intrusion detection model by using the second feature data set, so as to obtain a trained intrusion detection model, where the trained intrusion detection model is used to perform intrusion detection on data to be detected.

Optionally, the processing module comprises: the dividing unit is used for dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relationship; the screening unit is used for carrying out feature screening on the multiple groups of data sets by utilizing the random forest model to obtain multiple groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features; and the processing unit is used for performing dimension reduction processing on the multiple groups of target feature sets to obtain a second feature data set.

Optionally, the screening unit comprises: the prediction subunit is used for predicting the multiple groups of data sets by using the random forest model to obtain a score value of each original characteristic contained in the multiple groups of original characteristic sets, wherein the score value is used for representing the importance degree of each original characteristic; the first obtaining subunit is used for obtaining a score mean value of each original feature based on the score value of each original feature contained in the multiple groups of original feature sets; and the first determining subunit is used for determining a plurality of groups of target feature sets based on the score mean value of each original feature.

Optionally, the determining subunit further performs ascending sorting on the plurality of original features according to the score mean of each original feature, and obtains a first preset number of the top original features in the sorted plurality of original features to obtain a plurality of target features.

Optionally, the processing unit comprises: the construction subunit is used for constructing a first matrix based on the multiple groups of target feature sets; a second obtaining subunit, configured to obtain a covariance matrix of the first matrix; a second determining subunit, configured to determine a second matrix based on the covariance matrix; the second obtaining subunit is further configured to obtain a product of the first matrix and the second matrix, and obtain a second feature data set.

Optionally, the second determining subunit is further configured to obtain eigenvalues and eigenvectors of the covariance matrix, sort the eigenvectors according to sizes of the eigenvalues, generate a third matrix, obtain a second preset number of row matrices in the third matrix, and generate the second matrix.

Optionally, the processing unit comprises: the first processing subunit is used for carrying out zero equalization processing on the first matrix to obtain a fourth matrix; and the third acquisition subunit is used for acquiring the covariance matrix of the fourth matrix.

Optionally, the processing unit comprises: the second processing subunit is used for performing centralized processing on the first matrix to obtain a fifth matrix; and the fourth acquiring subunit is used for acquiring the product of the fifth matrix and the second matrix to obtain a second characteristic data set.

Optionally, the processing module comprises: the dividing unit is also used for randomly dividing the plurality of groups of data sets for a plurality of times to obtain a plurality of groups of training sets and test sets; the first training unit is used for training the random forest model by utilizing a plurality of groups of training sets; the testing unit is used for testing the trained random forest model by using the testing set to obtain the total score of the trained random forest model; and the determining unit is used for determining whether the training of the random forest model is finished or not based on the total score.

Optionally, the training module comprises: the detection unit is used for carrying out misuse detection on the second characteristic data set to obtain a third characteristic data set, wherein the characteristic data contained in the third characteristic data set is used for representing non-attack data or normal data; and the second training unit is used for carrying out iterative training on the plurality of base classifiers by utilizing an integrated learning algorithm based on the third characteristic data set to obtain a trained intrusion detection model.

Optionally, the detection unit comprises: the prediction subunit is used for predicting the preset models of the different types by utilizing the second characteristic data set and determining the detection rates of the preset models of the different types; the third determining subunit is used for determining the preset model corresponding to the maximum detection rate as the target model; the detection subunit is used for carrying out misuse detection on the second characteristic data set by using the target model to obtain a detection result of the second characteristic data set; and the fifth acquiring subunit is used for acquiring a third characteristic data set based on the detection result of the second characteristic data set.

Optionally, the plurality of different types of preset models in the detection unit include: decision tree models, support vector machine models and naive Bayes models.

Optionally, the processing module is further configured to format the first feature data set to obtain a processed first feature data set, where types of variables included in the processed first feature data set are the same.

Example 3

According to an embodiment of the present invention, there is further provided a computer-readable storage medium, where the computer-readable storage medium includes a stored program, and when the program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the intrusion detection method in embodiment 1.

Example 4

According to an embodiment of the present invention, there is further provided a processor, where the processor is configured to execute a program, where the program executes the intrusion detection method in embodiment 1.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An intrusion detection method, comprising:

acquiring a first characteristic data set;

performing dimensionality reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimensionality of the second characteristic data set is smaller than that of the first characteristic data set;

and training an intrusion detection model by utilizing the second characteristic data set to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for carrying out intrusion detection on data to be detected.

2. The method of claim 1, wherein performing dimension reduction on the first feature data set to obtain a second feature data set comprises:

dividing the first characteristic data set by using a cross verification method to generate a plurality of groups of data sets, wherein any two groups of data sets have a mutual exclusion relationship;

and performing feature screening on the multiple groups of data sets by using a random forest model to obtain multiple groups of target feature sets, wherein each group of target feature sets comprises: a plurality of target features;

and performing dimension reduction processing on the multiple groups of target feature sets to obtain the second feature data set.

3. The method of claim 2, wherein performing feature screening on the plurality of sets of data sets using the random forest model to obtain a plurality of sets of target feature sets comprises:

predicting the multiple groups of data sets by using the random forest model to obtain a score value of each original feature contained in the multiple groups of original feature sets, wherein the score value is used for representing the importance degree of each original feature;

obtaining a score mean value of each original feature based on the score value of each original feature contained in the multiple groups of original feature sets;

and determining the multiple groups of target feature sets based on the score mean of each original feature.

4. The method of claim 3, wherein determining the plurality of sets of target feature sets based on the mean score of each of the raw features comprises:

according to the score average value of each original feature, sequencing the original features in an ascending order;

and acquiring the first preset number of original features at the forefront in the sequenced plurality of original features to obtain the plurality of target features.

5. The method of claim 2, wherein performing dimension reduction on the plurality of sets of target feature sets to obtain the second feature data set comprises:

constructing a first matrix based on the plurality of sets of target feature sets;

acquiring a covariance matrix of the first matrix;

determining a second matrix based on the covariance matrix;

and acquiring the product of the first matrix and the second matrix to obtain the second characteristic data set.

6. The method of claim 5, wherein determining the second matrix based on the covariance matrix comprises:

obtaining an eigenvalue and an eigenvector of the covariance matrix;

sorting the eigenvectors according to the magnitude of the eigenvalues to generate a third matrix;

and acquiring a second preset number of row matrixes at the forefront in the third matrix to generate the second matrix.

7. The method of claim 5, wherein prior to obtaining the covariance matrix of the first matrix, the method further comprises:

carrying out zero equalization processing on the first matrix to obtain a fourth matrix;

obtaining the covariance matrix of the fourth matrix.

8. The method of claim 5, wherein prior to obtaining the product of the first matrix and the second matrix to obtain the second feature data set, the method further comprises:

performing centralization processing on the first matrix to obtain a fifth matrix;

and acquiring the product of the fifth matrix and the second matrix to obtain the second characteristic data set.

9. The method of claim 2, wherein before performing feature screening on the plurality of sets of data sets using the random forest model to obtain a plurality of sets of target feature sets, the method further comprises:

dividing the multiple groups of data sets for multiple times randomly to obtain multiple groups of training sets and test sets;

training the random forest model by using the multiple groups of training sets;

testing the trained random forest model by using the test set to obtain the total score of the trained random forest model;

determining whether training of the random forest model is complete based on the total score.

10. The method of claim 1, wherein training an intrusion detection model using the second feature data set to obtain the trained intrusion detection model comprises:

carrying out misuse detection on the second characteristic data set to obtain a third characteristic data set, wherein the characteristic data contained in the third characteristic data set is used for representing non-attack data or normal data;

and performing iterative training on a plurality of base classifiers by using an ensemble learning algorithm based on the third feature data set to obtain the trained intrusion detection model.

11. The method of claim 10, wherein performing misuse detection on the second feature data set to obtain a third feature data set comprises:

predicting a plurality of different types of preset models by utilizing the second characteristic data set, and determining the detection rate of the plurality of different types of preset models;

determining a preset model corresponding to the maximum detection rate as a target model;

carrying out misuse detection on the second characteristic data set by using the target model to obtain a detection result of the second characteristic data set;

and obtaining the third characteristic data set based on the detection result of the second characteristic data set.

12. The method of claim 11, wherein the plurality of different types of preset models comprises: decision tree models, support vector machine models and naive Bayes models.

13. The method of claim 1, wherein after acquiring the first feature data set, the method further comprises:

and formatting the first characteristic data set to obtain a processed first characteristic data set, wherein the types of variables contained in the processed first characteristic data set are the same.

14. An intrusion detection device, comprising:

an obtaining module, configured to obtain a first feature data set;

the processing module is used for performing dimensionality reduction processing on the first characteristic data set to obtain a second characteristic data set, wherein the dimensionality of the second characteristic data set is smaller than that of the first characteristic data set;

and the training module is used for utilizing the second characteristic data set to train the intrusion detection model to obtain a trained intrusion detection model, wherein the trained intrusion detection model is used for carrying out intrusion detection on data to be detected.

15. A computer-readable storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the intrusion detection method according to any one of claims 1 to 13.

16. A processor configured to execute a program, wherein the program executes to perform the intrusion detection method according to any one of claims 1 to 13.