CN112183598A

CN112183598A - Feature selection method based on genetic algorithm

Info

Publication number: CN112183598A
Application number: CN202010996242.6A
Authority: CN
Inventors: 周红芳; 郭晓杰
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2021-01-05

Abstract

The invention discloses a feature selection method based on a genetic algorithm, which provides a feature selection algorithm based on a combination of two-step filtering and the genetic algorithm, divides features into four parts, namely strong correlation features, weak correlation non-redundancy features, weak correlation redundancy features and irrelevant features by analyzing the correlation between the features and the categories and the redundancy between the features, then guides the initialization of the genetic algorithm by using the four parts of features to select the features, and through experiments, the improved initialization strategy selection selects fewer features compared with the traditional random initialization strategy, thereby obtaining higher classification accuracy.

Description

Feature selection method based on genetic algorithm

Technical Field

The invention belongs to the technical field of data preprocessing, and relates to a feature selection method based on a genetic algorithm.

Background

With the advent of the big data age, increasing the data dimension has created the problem of "dimension explosion," and feature selection is one of the effective methods to solve the problem. Feature selection is a dimension reduction method for selecting M features (M > M) from M features to represent original data. The feature selection reduces dimensionality by removing irrelevant and redundant features while also ensuring performance of algorithm execution. The feature selection has the advantages of reducing the number of features, avoiding overfitting, saving storage space and improving the execution efficiency of the algorithm. Feature selection is widely applied to image classification, book classification, financial field, medical field, and the like.

Three methods of feature selection: the Filter method, the Wrapper method, the Embedded method. The Filter method is the most common feature selection method, and uses measurement indexes such as information entropy, distance, correlation coefficient and the like to score and sort the features, and then filters out the features larger than a threshold value as a result. The Filter method has the advantages that the algorithm is strong in universality, simple and efficient in calculation, and suitable for data preprocessing of large-scale data, and the Filter method has the defect that the filtering process is independent of a model, so that the algorithm is general in performance. The Wrapper process has received increasing attention as it overcomes the shortcomings of the Filter process. The Wrapper method is to select a feature subset according to a search strategy and then classify a sample by using a classifier, wherein the classification precision is used as a standard for measuring the quality of the feature subset. Wrapper improves the performance of the algorithm, but the time complexity of the algorithm is high due to the classification calculation classification precision used for each feature subset. From the Wrapper method perspective, feature selection can be considered as an optimization process. The Embedded method is to embed the feature selection algorithm itself as a component into the model, and to select features that are advantageous for model training in the case of model selection. The advantages are fast, algorithm-oriented, and the disadvantages require adjustment of structure and parameter configuration.

The performance of the Wrapper method depends largely on how good the subset of features is selected. Assuming there are n features in the original feature set, then there is 2^n-1A subset of possibly non-empty features, the search strategy is from inclusion 2^n-1An optimal feature subset is found in the search space of the candidate solutions. There are three current search strategies for feature subsets: complete search, Heuristic search, Stochastic search. With heuristic and random searches being the most common. The heuristic search determines whether the remaining features should be selected at each iteration according to some heuristic rules, including forward selection, backward selection, selection based on instances, and the like. Classical algorithms include Sequential Forward Selection (SFS), sequential reverse selection (SBS), Relief. The SFS and SBS methods have the advantage of being simple and fast, and have the disadvantage of being able to only add features but not delete features, falling into local optima. The Relief adopts the distance measurement as an evaluation index, different weights are given to the distinguishing capability of the close-distance samples according to the characteristics, and the characteristics with the weights smaller than the threshold value are removed. Relief has the advantage of very efficient operation,the disadvantage is that it is limited to the binary problem and redundant data cannot be removed efficiently.

When the number n of the features of the original feature set is too large, the full-scale search and the heuristic search algorithm generally perform. The random search is widely used due to its wide search range and is suitable for solving the optimal problem of complex structure. The stochastic search starts with a randomly generated subset of features and gradually approaches the global optimal solution according to certain heuristic information and rules. Common random search algorithms include Genetic Algorithm (GA), particle swarm algorithm (PSO), simulated annealing algorithm (SA), and ant colony Algorithm (ACO). The genetic algorithm performs better than other random search algorithms because the feature selection is a complex high-dimensional problem, the genetic algorithm does not have too many mathematical requirements on the solved optimization problem, and the discrete or continuous problem can be processed whether linear or nonlinear.

Disclosure of Invention

The invention aims to provide a feature selection method based on a genetic algorithm, which solves the problem that the influence of an initial population on a final result can be neglected by the feature selection method of the traditional genetic algorithm in the prior art.

The invention adopts the technical scheme that a feature selection method based on a genetic algorithm is implemented according to the following steps:

step 1: data preprocessing, namely performing equidistant scattering processing on continuous data; filling with the mean of the attribute for the default value; processing abnormal values by using a box graph analysis method;

step 2: the characteristic classification is to divide the characteristics into four characteristic subsets of strong correlation, weak correlation non-redundancy, weak correlation redundancy and non-correlation according to the information entropy;

and step 3: the genetic algorithm guides the initialization of a genetic algorithm population by using the classification result obtained in the step 2, and then iteration is carried out to achieve the purpose of feature selection;

and 4, step 4: evaluation of the results.

The invention is also characterized in that:

defining the strong correlation characteristics in the step 2 as if and only if the characteristics belong to the strong correlation characteristic subset, then the characteristics are called the strong correlation characteristics;

a weakly correlated non-redundant feature is defined as a feature that is said to be weakly correlated non-redundant if and only if the feature belongs to a subset of weakly correlated non-redundant features;

the definition of a weakly correlated redundant feature is that a feature is said to be weakly correlated redundant feature if and only if it belongs to a subset of weakly correlated redundant features;

irrelevant features are defined as features that are called irrelevant features if and only if they belong to a subset of irrelevant features.

The step 2 is implemented according to the following steps:

step 2.1, calculating the correlation between the features and the class features, measuring the correlation between the features and the class features by using symmetry uncertainty, and dividing the features into a strong correlation feature subset, a weak correlation feature subset and an irrelevant feature subset according to the correlation;

and 2.2, calculating redundancy among the features, distinguishing redundant features from the weakly correlated features obtained in the step 2.1 by using interaction information, and finally further dividing the weakly correlated features into two feature subsets of weakly correlated non-redundancy and weakly correlated redundancy.

Step 3 is specifically implemented according to the following steps:

step 3.1, a coding mode adopts binary coding;

step 3.2, initializing, namely dividing the characteristics into four parts, namely strong correlation, weak correlation non-redundancy, weak correlation redundancy and non-correlation by using a Two step-filter method, so that the strong correlation and weak correlation non-redundancy parts have a larger probability of 1, and the WRR and IR parts have a smaller probability of 1;

step 3.3, calculating the fitness, wherein the fitness uses the classification accuracy of the SVM classifier and the NB classifier as the size of the individual fitness;

3.4, selecting, namely dividing the individual into three parts, namely high fitness, medium fitness and low fitness according to the fitness by adopting an improved layered selection operator; selecting the three parts by adopting a championship game;

step 3.5, crossing, namely using an improved crossing operator, wherein because the individuals or chromosomes in the population in the proposed algorithm consist of four parts, namely strong correlation, weak correlation non-redundancy, weak correlation redundancy and non-correlation, the crossing operation can randomly exchange one of the four parts of two father chromosomes;

3.6, mutation, wherein the mutation operator uses uniform mutation;

step 3.7, an elite retention strategy, adding two optimal individuals in each generation into an elite population;

and 3.8, terminating the condition, wherein the termination condition is that the algorithm is terminated when the fitness of the optimal individual reaches a given threshold value or the iteration times reaches preset times.

Step 4 is specifically implemented according to the following steps: and (3) performing experiments by using 10-fold cross validation under the condition that the SVM and the NB are used as classifiers respectively, and evaluating the classification precision and the selected feature quantity of the finally obtained feature subset.

Definition of symmetry uncertainty of step 2.1 equation (4):

wherein, I (X; Y) represents mutual information between random variables X and Y, and H (X) and H (Y) respectively represent information entropy.

Definition formula (6) of the interaction information of step 2.2:

I(X；Y；Z)＝I(X；Y|Z)-I(X；Y) (6)

where I (X; Y; Z) is used to measure the redundancy of the random variables X and Z with respect to Y.

The invention has the beneficial effects that:

1. compared with five classical feature selection algorithms (MRMR, IWFS, CFR, IGDGA and GGA), the feature selection algorithm based on the combination of the two-step filtering and the genetic algorithm has the advantages that the classification accuracy of the finally obtained feature subset is highest, and the number of the selected features is minimum.

Drawings

FIG. 1 is a flow chart of a genetic algorithm based feature selection method of the present invention;

FIG. 2 is a flow chart of step 2 of a genetic algorithm based feature selection method of the present invention;

FIG. 3 is a flow chart of step 3 of a genetic algorithm based feature selection method of the present invention;

FIG. 4 is a graph of the number of four-part feature subset features obtained in step 2 of the feature selection method based on the genetic algorithm of the present invention;

FIG. 5 is a graph of the difference between the initial population chaos of a genetic algorithm based feature selection method of the present invention versus a conventional genetic algorithm;

FIG. 6 is a diagram of the number of features of an optimal feature subset when an SVM classifier of the feature selection method based on a genetic algorithm is used as a fitness measure index according to the present invention;

FIG. 7 is a characteristic quantity diagram of an optimal characteristic subset under the fitness metric of an NB classifier of the genetic algorithm-based characteristic selection method of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a feature selection method based on a genetic algorithm, which is specifically implemented according to the following steps as shown in figure 1:

step 1, data preprocessing. Due to the possible continuous data contained in the data set, the possible default values contained, the possible outliers contained. Data preprocessing is therefore required for the data. Carrying out equidistant scattering processing on the continuous data; filling with the mean of the attribute for the default value; abnormal values were processed using a boxplot analysis method.

Step 2, feature classification, as shown in fig. 2, feature selection may be defined as a process of detecting relevant features and discarding irrelevant and redundant features, and the purpose of the feature selection is to obtain a feature subset capable of maintaining or even improving the performance of the original data set. The invention first classifies features into strongly correlated features (SR), weakly correlated features (WR) and uncorrelated features (IR) according to correlation; since weakly correlated features account for the majority of the total features, redundant features exist. The weakly correlated features are then further classified into weakly correlated non-redundant features (WRNR) and weakly correlated redundant features (WRR).

The four classes of features in step 2 are defined as follows:

definition 1: strongly correlated features are referred to as strongly correlated features if and only if the features belong to a subset of strongly correlated features.

Definition 2: weakly correlated non-redundant features are referred to as weakly correlated non-redundant features if and only if the features belong to a subset of weakly correlated non-redundant features.

Definition 3: a weakly correlated redundant feature is said to be a weakly correlated redundant feature if and only if the feature belongs to a subset of weakly correlated redundant features.

Definition 4: irrelevant features, a feature is said to be irrelevant if and only if it belongs to a subset of irrelevant features.

2.1, calculating the correlation between the features and the category features;

a commonly used measure of the correlation between features and class features is the knowledge about the entropy of the information. The definition formula (1) of the information entropy is

Wherein, Y represents a random variable, Y represents a possible value of the random variable, and p (Y) represents a probability that the random variable Y takes on the value Y. The information entropy represents the expectation of the information quantity brought by all possible events, the size of the information entropy is related to the probability of random variables, and the smaller the probability, the larger the information entropy contained by the events is; the more probable event contains less entropy of information.

Conditional entropy is defined as

Where X and Y represent random variables, and H (Y | X) represents the information entropy of the random variable Y given the random variable X.

Mutual information is defined as formula (3)

I(X；Y)＝H(Y)-H(Y|X) (3)

Wherein, mutual information of the random variables X and Y is equal to the information entropy of Y minus the information entropy of Y under the condition of known X. From another perspective, the information entropy represents the uncertainty of the system, and the condition entropy represents the uncertainty of the random variable Y under the condition that the random variable X is known, so the mutual information represents the degree of uncertainty reduction of the system. The greater the uncertainty reduction means that the newly added variable is more correlated with the original variable.

However, mutual information is more biased towards features with more values. To overcome the disadvantage of mutual information requires the introduction of normalization to ensure comparability between features, and thus symmetry uncertainties.

The symmetry uncertainty defines equation (4).

Wherein, I (X; Y) represents mutual information between random variables X and Y, and H (X) and H (Y) respectively represent information entropy. It can be seen that both the symmetry uncertainty and the mutual information represent the degree of reduction in the degree of uncertainty of the system.

Therefore, the symmetry uncertainty is used to measure the correlation between the features and the class features, and then the features are divided into three parts, namely a strong correlation feature subset, a weak correlation feature subset and an irrelevant feature subset according to the correlation size.

Step 2.2, calculating the redundancy among the characteristics;

the interaction information is used to measure the size of features and the redundancy between features. The conditional mutual information is defined by the formula (5)

I(X；Y|Z)＝H(X|Z)-H(X|Y,Z) (5)

Wherein the conditional mutual information indicates a degree of uncertainty reduction between the random variables X and Y when the random variable Z is introduced.

The interaction information is defined by formula (6)

I(X；Y；Z)＝I(X；Y|Z)-I(X；Y) (6)

Where I (X; Y; Z) is used to measure the redundancy of the random variables X and Z with respect to Y. The interaction information may be positive, negative, or zero. When the correlation between X and Y facilitated by the addition of the random variable Z is represented for positive, i.e., Z can provide more information than X and Y alone, X and Z are not redundant; when the addition of a random variable Z is negative, the correlation between X and Y is blocked, and X and Z are redundant; the addition of Z when zero does not affect the correlation between X and Y.

Therefore, the weakly correlated features obtained in step 2.1 are used to distinguish redundant features using interaction information, and finally, the weakly correlated features are further divided into two feature subsets of weakly correlated non-redundant and weakly correlated redundant.

And 3, a genetic algorithm, shown in the step 3, setting parameters as shown in a table 3, guiding the initialization of the genetic algorithm population by using the two pairs of feature classification results in the step, and then performing iteration to achieve the purpose of feature selection.

Step 3.1, coding mode, genetic algorithm can not directly process the parameters of problem space, and the specific problem must be converted into an individual or chromosome composed according to a certain structure. Commonly used encoding methods include binary encoding, floating-point encoding, and character encoding. Binary coding is the most common coding scheme in genetic algorithms at present, and {0,1} is used for representing the state of a feature. When 0, the feature is not selected; a value of 1 indicates that the feature is selected. The present invention uses binary coding.

Step 3.2, initializing the population is the process of generating a set of possible solutions, consisting of individuals (chromosomes). The final result of the genetic algorithm depends largely on the initialized population. Unlike the completely random initialization strategy, the Two step-filter method is used to divide the features into four parts of strong correlation (SR), weak correlation non-redundancy (WRNR), weak correlation redundancy (WRR) and non-correlation (IR), so that the SR and WRNR parts have a larger probability of 1 and the WRR and IR parts have a smaller probability of 1

And 3.3, calculating the fitness, wherein the fitness function of the genetic algorithm is used for evaluating the adaptability of the individuals in the population to the environment. In other words, it is the evaluation index of the quality degree of the individual and is also the most important basis for the execution of the selection operator. The classification accuracy of the SVM and NB classifiers is used herein as the magnitude of the individual fitness.

And 3.4, selecting, wherein a selection operator of the genetic algorithm is generated by simulating a rule of 'race selection and survival of suitable persons' in the nature. The greater the fitness of an individual, the greater the probability of surviving a selection, and the smaller the fitness, the lesser the probability of surviving a selection. Common selection methods are roulette selection, tournament selection. The tournament selection strategy takes a certain number of individuals from the population at a time and then selects the best two of them into the offspring population. This operation was repeated until the population size reached the original size. The method adopts an improved layered selection operator to divide an individual into three parts, namely high fitness, medium fitness and low fitness according to the fitness; each section is selected using a tournament. Thus, the diversity of the population can be maintained to a certain extent

And 3.5, crossing, wherein a crossing operator in the genetic algorithm is generated by simulating the rule of 'gene recombination' in nature and is also an operator playing a core role in the algorithm. The crossover operator generates two new individuals by randomly swapping portions of the genes on the two individuals. Common crossing methods are single point crossing, multi-point crossing, and uniform crossing. Using the improved crossover operator herein, since the proposed algorithm consists of four segments SR, WRNR, WRR, IR for the individuals (chromosomes) in the population, the crossover operation will randomly swap one of the two parent chromosome segments.

And 3.6, mutation, wherein a mutation operator of the genetic algorithm is generated by simulating the rule of 'gene mutation' in nature. Mutation refers to the transformation of certain genes in an individual (chromosome). The genetic algorithm introduces variation with two purposes, namely, the genetic algorithm has local random search capability. Secondly, the genetic algorithm can maintain the diversity of the population so as to prevent the premature convergence phenomenon. The commonly used mutation operators have uniform mutation. Uniform variation is used herein.

And 3.7, an elite retention strategy is adopted, so that the genetic algorithm cannot converge to a global optimal solution due to the fact that the optimal individuals in each generation of population are lost in the next generation. Therefore, elite selection is proposed. Herein, two optimal individuals per generation were added to the elite population

And 3.8, terminating the condition, wherein the terminating condition of the genetic algorithm is that the algorithm is terminated when the fitness of the optimal individual reaches a given threshold value or the iteration times reaches preset times.

Step 4, evaluating results;

and (3) performing experiments by using 10-fold cross validation under the condition that the SVM and the NB are used as classifiers respectively, and evaluating the classification precision and the selected feature quantity of the finally obtained feature subset. k-fold cross validation is used to evaluate the predictive performance of the model. The k-fold cross validation averagely divides the original data into k groups, each subset data is respectively made into a primary validation set, the rest k-1 groups of subset data are used as training sets, so that k models are obtained, and the average of the classification Accuracy (Accuracy) of the k validation sets is used as a final result. The k-fold cross validation can effectively avoid over-fitting and under-fitting methods, and the finally obtained result is convincing.

The accuracy is the most common classifier evaluation index, as shown in formula (7)

Wherein,

tp (tube positive) represents the number of instances that are actually positive and are classified as positive by the classifier.

Tn (tube negative) represents the number of instances that are actually negative and are classified as negative by the classifier.

FP (false positive) represents the number of instances that are actually negative but are divided into positive instances by the classifier.

FN (false negative) represents the number of instances that are actually positive but are divided into negative by the classifier.

The pseudo code of the feature selection algorithm based on the combination of the two-step filtering and the genetic algorithm in the present invention is shown in table 1:

TABLE 1

Evaluation of the Performance of the present invention:

to verify the effectiveness of the present invention, several advanced algorithms (including MRMR, IWFS, CFR, IGDGA, GGA) were used to compare 9 sets of numbers from UCI machine learning retrieval. The details of the experimental data set are shown in table 2. Through experiments, the algorithm (TFCGA) based on the combination of the two-step filtration and the genetic algorithm provided by the invention selects fewer features and obtains higher classification accuracy.

Table 2: UCI dataset description

Table 3: parameter setting of genetic algorithm

Parameter	Value
		Population size
	60
		Number of generations	30
Crossover probability	0.7
		Mutation probablity	0.1

Table 4: accuracy of SVM classifier as classification under fitness (%)

As can be seen from table 4: the TFCGA method provided by the invention has the highest overall accuracy rate under the condition that the SVM classifier is used as the fitness.

TABLE 5 NB classifier as accuracy of classification under fitness (%)

Dataset	MRMR	IWFS	CFR	IGDGA	GGA	TFCGA
							Musk1	76.24	71.41	78.13	81.31	81.50	81.37
Musk2	78.25	74.31	73.83	79.21	79.83	80.12
							Arraythmia	59.80	57.04	56.91	64.36	64.71	65.71
Colon cancer	74.29	59.76	71.43	79.21	75.08	75.95
							Multiple Features	94.80	94.25	94.50	95.12	96.32	96.32
Libras movement	52.67	44.00	49.00	57.81	64.26	63.56
							Urban land cover	68.26	71.61	68.30	72.72	79.21	79.22
Semeion handwritten	72.64	73.18	73.27	81.99	83.36	82.88
							qsar_androgen_receptor	77.14	76.01	77.13	79.92	79.84	80.06

As can be seen from table 5: the TFCGA method provided by the invention has the highest overall accuracy rate under the condition of using the NB classifier as the fitness.

Table 6: feature quantity of optimal feature subset under fitness measurement index of SVM classifier

As can be seen from table 6: the SVM classifier is used as fitness measure, and the characteristic number of the optimal characteristic subset obtained by the TFCGA method is obtained.

As can be seen in fig. 4: after two-step filtering, classifying the features to obtain four parts of feature strong correlation feature subsets, weak correlation non-redundancy feature subsets, weak correlation redundancy feature subsets and specific number of each part of the non-correlation feature subsets.

From fig. 5 it can be seen that: after the results of the feature classification using the two-step filtering guide the initialization of the genetic algorithm population, the uncertainty of the individuals in the population is reduced.

As can be seen in fig. 6: the SVM classifier is used as a fitness measure, and the feature quantity of the optimal feature subset obtained by the method TFCGA of the invention is the total minimum

As can be seen in fig. 7: the NB classifier as the fitness measure means that the feature quantity of the optimal feature subset obtained by the TFCGA method of the invention is the minimum overall.

Claims

1. A feature selection method based on genetic algorithm is characterized by comprising the following steps:

and 4, step 4: evaluation of the results.

2. The method for selecting features based on genetic algorithm as claimed in claim 1, wherein the definition of the strongly correlated features in step 2 is that the features are called as strongly correlated features if and only if the features belong to the subset of the strongly correlated features;

3. The method for selecting features based on genetic algorithm as claimed in claim 2, wherein the step 2 is implemented by the following steps:

4. The method for selecting features based on genetic algorithm as claimed in claim 3, wherein the step 3 is implemented by the following steps:

step 3.1, a coding mode adopts binary coding;

3.4, selecting, namely dividing the individual into three parts, namely high fitness, medium fitness and low fitness according to the fitness by adopting an improved layered selection operator, wherein the three parts are selected by adopting a championship game;

3.6, mutation, wherein the mutation operator uses uniform mutation;

and 3.8, terminating the algorithm when the fitness of the optimal individual reaches a given threshold or the iteration times reaches a preset number.

5. The method for selecting features based on genetic algorithm as claimed in claim 4, wherein the step 4 is implemented by the following steps: and (3) performing experiments by using 10-fold cross validation under the condition that the SVM and the NB are used as classifiers respectively, and evaluating the classification precision and the selected feature quantity of the finally obtained feature subset.

6. A genetic algorithm based feature selection method according to claim 3, wherein the symmetry uncertainty of step 2.1 defines formula (4):

7. A genetic algorithm based feature selection method according to claim 3, wherein the interaction information of step 2.2 defines formula (6):

I(X；Y；Z)＝I(X；Y|Z)-I(X；Y) (6)