CN112507332A

CN112507332A - Artificial intelligence network security attack flow retrieval method

Info

Publication number: CN112507332A
Application number: CN202011361014.8A
Authority: CN
Inventors: 张秋余; 董瑞洪; 袁晖; 胡颖杰; 王春霞; 赵金雄
Original assignee: Electric Power Research Institute of State Grid Gansu Electric Power Co Ltd; Lanzhou University of Technology
Current assignee: Electric Power Research Institute of State Grid Gansu Electric Power Co Ltd; Lanzhou University of Technology
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-03-16

Abstract

The invention provides an artificial intelligence network security attack flow retrieval method, which comprises the following steps: and S1, selecting different characteristic value groups, wherein the detection precision for the attack types is different. S2, selecting a group of features with highest detection accuracy for a certain type, wherein the features are close to 100%, training a classifier by using the group of features, and so on, wherein each type has a targeted classifier, and the classifiers are combined to detect that if the data set has i types, i binary classifiers are formed by feature value selection and model training. Each classifier has a very high recognition rate for a class, and a "hit" is determined if the classifier determines that the sample is the class for which the classifier is directed. For each group of data, extracting corresponding characteristic values, inputting the characteristic values into the i classifiers respectively, observing classification results, identifying new attacks while detecting fixed attacks, and automatically adjusting models or strategies through attacks or environmental changes.

Description

Artificial intelligence network security attack flow retrieval method

Technical Field

The invention mainly relates to the technical field of network security, in particular to a method for searching artificial intelligent network security attack flow.

Background

Most of research is mainly aimed at improving the accuracy of an intrusion detection system and reducing the false alarm rate. Adaptability is a pending problem and is a disadvantage. Most intrusion detection can only detect fixed attacks, new attacks cannot be identified, and models or strategies cannot be adjusted automatically after the attacks or the environment changes. Adaptive properties mean that ids should be able to adapt to the needs of the new environment without requiring administrator feedback. This means that the IDS adaptation process cannot be based on tagged data, requiring the use of methods that are applicable to untagged data.

Disclosure of Invention

The invention mainly provides an artificial intelligence network security attack flow retrieval method, which is used for solving the technical problems in the background technology.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the method for searching the artificial intelligent network security attack flow comprises the following steps:

and S1, selecting different characteristic value groups, wherein the detection precision for the attack types is different.

S2, selecting a group of features with highest detection accuracy for a certain type, wherein the features are close to 100%, training a classifier by using the group of features, and so on, wherein each type has a targeted classifier, and the classifiers are combined to detect that if the data set has i types, i binary classifiers are formed by feature value selection and model training. Each classifier has a very high recognition rate for a class, and a "hit" is determined if the classifier determines that the sample is the class for which the classifier is directed. For each group of data, extracting corresponding characteristic values, then respectively inputting the characteristic values into the i classifiers, and observing classification results, wherein the classification results comprise the following conditions:

case 1: one classifier hits, and the other classifiers miss. The sample is determined to be the class targeted by the hit classifier, which is the simplest case, i.e. the sample is a known attack or normal traffic;

case 2: all classifiers miss. The sample may be a new category, which is considered a category; in this case, which indicates that the sample may be an unknown attack or a new form of a known attack, we tend to consider the new form of the known attack (i.e., the attack signature is different from the known signature) as a new attack. In this case, therefore, all samples are classified into a uniform class, which is considered as a new type of attack. When the sample size reaches a certain number and can be used for training, training a new classifier to identify the class;

case 3: more than 1 classifier hit. The sample may be of a new class. In this case, the samples are regarded as unknown attacks, and the samples with the same hit are regarded as the same class, so that (2 i-th power-1-i) unknown attacks (2 hits of C (i, 2) and 3 hits of C (i, 3), … …, and 1 hit of i) can be theoretically identified. When the number of the classifiers reaches a certain degree and is enough to be trained, the model is updated, and a two-classifier is added. Thus, after a certain number of unknown samples have been sufficient to cause the model to be updated, the "unknown" samples become "known" and can be hit by the just-newly generated classifier.

Further, the selected classical machine learning algorithm may adopt one of SVM, Random Forest (RF) and AdaBoost.

Further, the Libsvm algorithm is used for training the multi-classifier, and the classification precision is compared with the classification precision of data standardization.

Compared with the prior art, the invention has the beneficial effects that:

the method for searching the artificial intelligent network security attack flow can identify the new attack when the fixed attack is detected, and automatically adjust the model or the strategy through the attack or the change of the environment.

The present invention will be explained in detail below with reference to the drawings and specific embodiments.

Drawings

FIG. 1 is a schematic diagram of the overall process architecture of the present invention;

FIG. 2 is a schematic diagram of the present invention showing that the addition of data normalization can greatly improve the accuracy of the model;

FIG. 3 is a graph illustrating the accuracy and recall of various classification algorithms of the present invention on a validation set;

FIG. 4 is a diagram illustrating the most important 40 features of the BRF classifier according to the present invention;

FIG. 5 is a diagram of classifier training data of the present invention;

FIG. 6 is a diagram of the accuracy (precision) of the two classifiers for a particular class for different feature numbers according to the present invention;

FIG. 7 is a diagram illustrating the recall (recall) of the two classifiers of the present invention for a specific class with different feature numbers;

FIG. 8 is a diagram illustrating the minimum valid feature set for each class of the present invention.

Detailed Description

In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in different forms and not limited to the embodiments described herein, but which are provided so as to provide a more thorough and complete disclosure of the invention.

Example 1:

1. the method for searching the artificial intelligent network security attack flow is characterized by comprising the following steps:

The Libsvm algorithm is used for training the multi-classifier, and when the classification accuracy of data standardization is compared with that of the data standardization shown in fig. 2, it can be seen that the standardization of the data greatly contributes to the performance of the lifting model, and the data of all experiments in the report are standardized before training.

The data, after being partitioned and preprocessed, may be used to train a multi-class classifier. The classical machine learning algorithm that can be selected is SVM, random forest (RF for short), AdaBoost, and the like. The accuracy (P) and recall (R) of the various classification models on the validation set are shown in FIG. 3.

Although the average accuracy of each classification algorithm on the verification set is above 95%, the number of samples in the three categories, "Web attach", "Botnet ARES" and "infilteration" is small due to the uneven distribution of the sample brushing amount shown in table 2. The accuracy rate and the recall rate of the SVM algorithm in the three categories are obviously inferior to those of the SVM algorithm in other categories, and the accuracy rate and the recall rate of other multi-sample categories are slightly lower than those of the SVM algorithm in other three categories.

The BRF algorithm is a balanced random forest (balanced random forest classifier) classification algorithm provided by an imblearn library, is suitable for processing class imbalance data, and needs to be matched with a variation smotetome (provided by imblear. complex) for synthesizing a minority class oversampling technology (SMOTE) to sample training samples into classes for equalization. The subsequent experiments in this report all used the algorithm combination of BRF + smotetome.

The SMOTE algorithm has an unobvious improvement effect on the RF and Adaboost algorithms, but can make the model precision more stable without improvement on the SVM algorithm. The essential reason is that the number of samples in the 'Infiltration' category is too small, and the space for synthesizing and sampling is limited, so that the sampled samples in the category have strong homogeneity and insufficient diversity.

Under the combination of the BRF + smotomek algorithm, the classification effect of all classes is at a higher level, so the purpose of feature selection is to explore that the reduction of the identification effect of each class can be maintained at a smaller amplitude, i.e. a minimum effective feature subset is found, under the condition that which features are selected as few as possible.

Referring to fig. 4, the BRF classifier can output the percentage contribution of all features (79) to the classification result for the current classification task, and the contribution reflects the importance of the features to the samples (only the top 40 important samples are shown in the figure).

In order to obtain the minimum effective feature subset of each class sample, 7 classifiers are required to be trained to distinguish each class from all other classes, and when each classifier is trained, the labels of all samples in non-classes are set to 0, and the data statistics are as shown in fig. 5.

After the feature contribution percentages of the classifiers are obtained, the first k most important features are selected, and 7 classifiers are trained again. The values of k are continuously reduced (the number of selected features) and the descending conditions of the accuracy rate and the recall rate of each category of classification are observed as shown in figures 6 and 7.

In 7 classifiers, the recall rate of all classes is higher when k values are respectively 5-20, when k is 20, the performance of all the two classifiers is the closest to the performance of a model when all features are used, and when k is 10, the accuracy rate of the 'Infiltration' class is suddenly reduced.

In summary, the two classifiers trained by selecting the top 15 most important features for each class have the best effect, and the minimum effective feature combination for each class is shown in fig. 8.

The invention is described above with reference to the accompanying drawings, it is obvious that the invention is not limited to the above-described embodiments, and it is within the scope of the invention to adopt such insubstantial modifications of the inventive method concept and solution, or to apply the inventive concept and solution directly to other applications without modification.

Claims

2. The method for retrieving the flow of the artificial intelligence network security attack according to claim 1, wherein one of SVM, random forest (RF for short) and AdaBoost is adopted as the selectable classic machine learning algorithm.

3. The artificial intelligence network security attack traffic retrieval method of claim 1, wherein a Libsvm algorithm is used to train the multi-classifier, and comparison is performed with classification accuracy with or without data standardization.