CN110689074A

CN110689074A - Feature selection method based on fuzzy set feature entropy value calculation

Info

Publication number: CN110689074A
Application number: CN201910914920.7A
Authority: CN
Inventors: 郭方方; 孙思佳; 赵天宇; 吕宏武; 冯光升; 王瑞妮; 王欣悦; 何迪
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2020-01-14

Abstract

The invention belongs to the field of safety data analysis, and particularly relates to a feature selection method based on fuzzy set feature entropy value calculation. The method mainly comprises the steps of calculating an ideal vector matrix, calculating a similarity matrix, calculating the entropy of characteristics and calculating a scaling factor SF_i,dCalculating a characteristic entropy ranking SE_dAnd the invention uses the scaling factor of the characteristic and the entropy value between specific categories to calculate the distance between ideal vectors in each category, thereby optimizing the characteristic selection and reducing the calculation complexity. The method adopts a fuzzy centralized information entropy calculation method FIEE to solve the problem that the calculation cannot be carried out due to the huge characteristic value space in the traditional information gain and information gain ratio calculation method. The method can greatly reduce the calculation complexity.

Description

Feature selection method based on fuzzy set feature entropy value calculation

Technical Field

The invention belongs to the field of safety data analysis, and particularly relates to a feature selection method based on fuzzy set feature entropy value calculation.

Background

In recent years, with the increasing number of internet users, network security becomes an issue of great concern. Aiming at resisting the malicious behavior of a network attacker, an intrusion detection system IDS is an effective method, and a determining factor for determining the performance of the IDS is whether a corresponding efficient and accurate classification model exists, and how to build a proper classification model for the characteristics of a safety analysis data set also becomes a research hotspot in the field. A large number of redundant features are the biggest characteristics of safety analysis data, and irrelevant features can cause confusion of classifiers in the data analysis process, so that the model complexity is increased, the accuracy is reduced, and even overfitting can be caused. The feature selection method can reduce dimension, computational complexity and computational cost brought by irrelevant features, and reduces the influence of redundant features on the classification model to a certain extent. Therefore, the optimized feature selection method for reducing the redundant features of the security analysis data has important significance in the process of identifying and detecting the malicious network attack.

The problem encountered at present is that the requirement on the real-time performance of security monitoring in the field of network security analysis is high, however, the construction of a security analysis model adopting a packaging method largely depends on training data, and the complexity of model training is increased, so that the model is not suitable for preprocessing the security analysis data. However, the information entropy theory is widely applied in the field of network security analysis, and the information entropy theory can well measure the uncertainty degree of variables, so that the information entropy theory is used for completing the feature selection process of the security data.

Information entropy theory is an important concept in information theory, and is generally used for measuring unpredictability of things in the field of network security analysis and many related fields. The main applications of the information entropy in the field of network security analysis include intrusion detection, authentication and access control, privacy protection and trust calculation, information hiding and digital forensics, cloud computing, security problems in mobile social networks and the like. Fuzzy set theory is also an effective feature selection tool, rearranges feature sequences through a sorting mechanism, and can measure uncertainty of features by using conditional information entropy. In the applications, the information entropy theory can well evaluate the contribution of the data characteristics in the whole model, and when the information entropy theory is applied to the classification problem, the accuracy and the operation speed are obviously increased, so that the information entropy theory becomes a popular research direction in the field of network security analysis.

In summary, aiming at the problems of huge feature quantity and high calculation complexity in the field of network security analysis, the invention provides a feature selection method based on fuzzy set feature entropy calculation, which can optimize feature selection and reduce calculation complexity.

Disclosure of Invention

The invention aims to solve the problems of huge feature quantity and high calculation complexity in the field of network security analysis, and in order to solve the problem of feature importance evaluation in network security analysis data, although the classical entropy theory can directly evaluate the contribution degree of individual features to classification results, the traditional entropy calculation method is related to the value quantity of the features and only can process discrete features. The fuzzy set is a method which is efficient when safety analysis is carried out on a big data environment, so the invention provides a characteristic selection method based on fuzzy set characteristic entropy value calculation, and the characteristic entropy value is calculated for safety analysis data. The safety data Feature importance evaluation Method Based on the Information Entropy theory, namely, Feature Information evaluating Method Based on Information Entropy, FIEE uses the scaling factor of the Feature and the Entropy value between specific categories to calculate the distance between ideal vectors in each category, so that the Feature selection can be optimized, and the calculation complexity can be reduced.

The purpose of the invention is realized as follows:

a feature selection method based on fuzzy set feature entropy calculation comprises the following steps:

step 1, calculating an ideal vector matrix: calculating a generalized mean value of all samples on a training data set X to calculate an ideal vector matrix V;

step 2, calculating a similarity matrix: extracting a feature set A from a training data set X according to the vector matrix V solved in the step 1, and calculating an ideal vector V of each sample in the feature set A on the feature d and the class i_i,dSimilarity to feature d, S;

step 3, calculating the entropy of the features: calculating the entropy value of each characteristic according to the similarity matrix result calculated in the step 2, and summing the entropy values of the characteristic d on each sample and category;

step 4, calculating a scaling factor SF_i,d: calculating the difference between the mean value of each feature d on the category i and the mean values of the features d on other categories, and solving the scaling factor between each feature and each category by calculating the relative distance;

step 5, calculating the characteristic entropy value ranking SE_d: according to the scaling factor SF obtained in step 4_i,dThe entropy calculated in step 3 is scaled as a result of (3), to obtain the final feature entropy ranking SE_d；

And 6, deleting the features with the highest entropy, dividing the security analysis model into two types by using an iterative SVM (support vector machine) algorithm, judging whether the input network data is invasive data or normal data, verifying the performance of the algorithm on a test set, and repeating the steps until the performance reduction standard of the classification model is met.

Ideal vector v of step 2 above over feature d and class i_i,dThe similarity S with the feature d specifically includes:

wherein

Representing the corresponding value of class i on feature d in the ideal vector of the training sample, since the above formula for calculating similarity S is used for trainingTraining each sample X in sample set X_jAnd class i, S (x) thus obtained_j,d,v_i,d) Is a matrix of dimension N x (DN), where N is the size of the sample set, D represents the number of features, and N is the number of classes in the sample classification space.

The calculating the entropy value of the feature in step 3 further includes:

where N is the size of the sample set, N is the number of classes in the sample classification space, S (x)_j,d,v_i,d) Is a matrix of dimension n x (DN).

As described above for step 4 for the scaling factor SF_i,dThe specific calculation method comprises the following steps:

to calculate the difference between the mean of the values of one feature d in class i and the mean of the values of the feature d in the other classes, SF is defined_i,dIs in the range of [0,1]]N represents the number of categories, I represents one category, the denominator is N-1 because calculation needs to be carried out with all other categories except I itself, and the scaling factor SF_i,dThe concrete relationship is as follows:

to correspond to the concept of entropy, the calculated mean difference is subtracted from 1, and the corresponding result is also at [0,1]]And above the scaling factor SF_i,dThe result of the calculation of (a) represents that when the value in a class is greatly different from the mean value of other classes, the current feature has a smaller entropy value in the current class.

The specific calculation process of the characteristic entropy ranking in the step 5 comprises the following steps:

wherein SF_i,dRepresents a scaling factor, H_i,dRepresenting a characteristic entropy value, and N representing the number of categories;

SE calculated by combining sets of features_dThe values are sorted from small to large to obtain a large to small ordered sequence of feature importance.

The invention has the beneficial effects that:

the algorithm provided by the invention adopts a fuzzy centralized information entropy calculation method FIEE to solve the problem that the calculation cannot be carried out due to the huge characteristic value space in the traditional information gain and information gain ratio calculation method. The method can greatly reduce the computational complexity.

Drawings

FIG. 1 is a flow chart of an FIEE algorithm based on fuzzy set feature entropy calculation;

FIG. 2 is a diagram of experimental environment configuration of UNSW-NB 15.

Detailed Description

The invention provides a feature subset selection method based on fuzzy set feature entropy value calculation. The technical solution of the present invention will be described in further detail by the following specific embodiments.

To verify the effectiveness of the present invention, experiments were conducted on the security analysis data UNSW-NB15, observing the feature importance assessment process, and comparing the performance of the methods presented herein with classical feature importance assessment methods such as ReliefF, Laplacian, and luuka. To verify the performance of the invention on datasets other than the safety analysis dataset, tests were also performed on chronic kidney disease datasets herein.

(1) UNSW-NB15 dataset profiles

The UNSW-NB15 dataset, which is a true hybrid of normal user behavior and attack behavior in contemporary networks generated by the CyberRange Lab, CRL laboratories of the network security center ACCS in australia, using the IXIA PerfectStorm tool, was used herein, collected in 2015. The original data size is 100GB, containing 200 ten thousand records. This data set contains nine types of attacks, including Fuzzers, Analysis, backdors, Dos, exploites, Generic, Reconnaissance, Shellcode, and Worms, respectively. The experimental environment configuration when data was collected is shown in fig. 2. The IXIA system is configured with three virtual servers. Server 1 and server 3 are configured to propagate normal traffic, while server 2 forms abnormal traffic or performs malicious attacks in the network. To establish interworking between servers, obtaining public and private network data, there are two virtual interfaces, IP addresses 10.40.85.30 and 10.40.184.30, respectively. The server is connected to the host through two routers. Router 1 has two IP addresses, 10.40.85.1 and 10.40.182.1 respectively. Router 2 is configured with two IP addresses, 10.40.184.1 and 10.40.183.1 respectively. These routers are all connected to a firewall that is configured to pass all traffic, including normal traffic as well as abnormal traffic. The Tcpdump toolkit is installed on the router 1 to capture the Pcap file when the network environment is running normally. Furthermore, the core intent of the overall test platform is to capture normal or abnormal traffic data that originates from the IXIA tool and is spread across various network nodes, including servers as well as clients. Finally, 49 characteristics related to the network environment are collected, wherein the specific descriptions respectively including the original characteristics, the statistical characteristics and the cross characteristics are respectively shown in tables 1, 2 and 3.

Table 1 raw feature description in UNSW-NB15 dataset

TABLE 2 statistical characterization of the UNSW-NB15 data set

TABLE 3 Cross-characterization in UNSW-NB15 dataset

TABLE 4 tag characterization in UNSW-NB15 dataset

Tables 1 to 4 describe in detail the meaning of each feature in the data source collected by the IDS system used in the experiment, and the number is given, and only the feature number is used to identify the feature in the subsequent experimental process description.

(2) Brief introduction to Chronic Kidney disease

This chronic kidney disease data set was the first relevant data set in the field collected in 2015. The data set contains 24 features, 11 of which are numerical and 13 discrete. This data set corresponds to the two-class problem, i.e., label is 1 or 0, corresponding to patients and normal persons without chronic kidney disease, respectively. Table 5 gives a detailed description of the 24 features.

TABLE 5 Chronic kidney disease data set description

In addition, the characteristic importance performance evaluation method is an unsupervised learning method, so that label type characteristics are not evaluated.

The specific experimental contents of the invention are as follows:

step 1: an ideal vector matrix is calculated. And calculating a generalized mean value of all samples on the training data set X to calculate an ideal vector matrix V.

Step 2: and calculating a similarity matrix. Extracting a feature set A from a training data set X according to the vector matrix V solved in the step 1, and calculating an ideal vector V of each sample in the feature set A on the feature d and the class i_i,dSimilarity to feature d, S.

And step 3: the entropy of the features is calculated. And (3) calculating the entropy value of each characteristic according to the similarity matrix result calculated in the step 2, and summing the entropy values of the characteristic d on each sample and each category.

And 4, step 4: calculating a scaling factor SF_i,d. And calculating the difference between the mean value of each characteristic d in the category i and the mean values of the characteristics d in other categories. The scaling factor between each feature and class is solved by calculating the relative distance.

And 5: computing a feature entropy ranking SE_d. According to the scaling factor SF obtained in step 4_i,dThe entropy calculated in step 3 is scaled as a result of (3), to obtain the final feature entropy ranking SE_d。

Step 6: the feature with the highest entropy value is deleted. The security analysis model is divided into two types by using an iterative SVM algorithm, and whether the input network data is intrusion data or normal data is judged. And verifying the performance of the algorithm on the test set, and repeating the steps until the performance reduction standard of the classification model is met to 1%.

wherein

Representing the corresponding value of class i on feature d in the ideal vector of the training sample. Since this formula would be used for each sample X in the training sample set X_jAnd a category i. S (x) thus obtained_j,d,v_i,d) Is a matrix of dimension N x (DN), where N is the size of the sample set, D represents the number of features, and N is the number of classes in the sample classification space.

The calculating the entropy of the feature in step 3 further includes:

where N is the size of the sample set and N is the sampleThe number of classes in this classification space, S (x)_j,d,v_i,d) Is a matrix of dimension n x (DN).

For the scaling factor SF as described in step 4 above_i,dThe specific calculation method comprises the following steps:

to calculate the difference between the mean of the values of one feature d in class i and the mean of the values of the feature d in the other classes. Definition of SF_i,dIs in the range of [0,1]]N represents the number of categories, I represents one category, and the denominator is N-1 because calculation needs to be carried out on all the categories except I per se. Scaling factor SF_i,dThe specific relationship is as follows:

to correspond to the concept of entropy, the calculated mean difference is subtracted from 1, the corresponding result is also [0,1], and the result calculated by the equation represents that when the value in one class is greatly different from the mean value of the other classes, the current feature has a smaller entropy value in the current class.

wherein SF_i,dRepresents a scaling factor, H_i,dRepresenting the characteristic entropy value and N representing the number of classes.

SE calculated by combining sets of features_dThe values are sorted from small to large, and the ordered sequence of the feature importance from large to small can be obtained.

(1) Performance testing of UNSW-NB15 data set

Table 6 FIEE algorithm execution process on UNSW-NB15

In order to verify the validity of the FIEE algorithm, a comparative experiment was performed on the data set UNSW-NB15 with the methods proposed by the ReliefF method, luukka method, and luukka 2011, respectively, and the specific process of operation of each algorithm is described as shown in table 6. The FIEE method can be obtained to improve the model accuracy by 10.57 percent, which is much higher than 6.61 percent and 8.72 percent of the other two methods. In the feature deletion process, the FIEE method deletes 25 features, and the number of the deleted features in all the methods is the largest, which means that the algorithm of the present invention can obtain higher performance on fewer features.

(2) Performance testing of Chronic Kidney disease Collection

TABLE 7 FIEE implementation on Chronic renal disease datasets

From the feature selection process in table 7, it can be found that the data set is a typical data set doped with many redundant features, and the classification model keeps accuracy at all times during the feature deletion process, which indicates that the deleted feature has no influence on the classification problem. Although the 4 performance test methods can obtain higher accuracy, the FIEE algorithm can obtain 99.86% accuracy only according to 3 features, and the number of the final feature subsets of the other 3 methods is respectively 10, 13 and 9, so that the FIEE method greatly reduces the complexity of model building. The relationship of variation between the classification model accuracy and the number of removed features is analyzed below.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A feature selection method based on fuzzy set feature entropy calculation is characterized by comprising the following steps:

2. The feature selection method based on fuzzy set feature entropy calculation of claim 1, wherein: the ideal vector of step 2 above on the feature d and the class iQuantity v_i,dThe similarity S with the feature d specifically includes:

wherein

Representing the corresponding value of class i on feature d in the ideal vector of training samples, since the above formula for calculating similarity S would be used for each sample X in the set of training samples X_jAnd class i, S (x) thus obtained_j,d,v_i,d) Is a matrix of dimension N x (DN), where N is the size of the sample set, D represents the number of features, and N is the number of classes in the sample classification space.

3. The feature selection method based on fuzzy set feature entropy calculation of claim 1, wherein: the calculating the entropy value of the feature in step 3 further includes:

4. The feature selection method based on fuzzy set feature entropy calculation of claim 1, wherein: as described above for step 4 for the scaling factor SF_i,dThe specific calculation method comprises the following steps:

5. The feature selection method based on fuzzy set feature entropy calculation of claim 1, wherein: the specific calculation process of the characteristic entropy ranking in the step 5 comprises the following steps: