CN115249054A

CN115249054A - Improved hybrid multi-target particle swarm optimization feature selection algorithm

Info

Publication number: CN115249054A
Application number: CN202210202109.8A
Authority: CN
Inventors: 潘晓英; 孙俊
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-10-28

Abstract

The invention discloses an improved hybrid multi-objective particle swarm optimization feature selection algorithm (HIMOPSO), and belongs to the field of machine learning. The core idea is that in the first stage, correlation of features is calculated by adopting a fisher score algorithm based on probability distance and an MIC algorithm based on mutual information, and features are sorted and screened to obtain two feature subsets; then, the two feature subsets are subjected to intersection processing. In the second stage, a multi-target particle swarm algorithm is adopted as a search algorithm for feature selection, and a particle initialization method giving consideration to both population diversity and priori knowledge is provided; the algorithm combines the iteration times to carry out nonlinear improvement on the parameters of the multi-target particle swarm algorithm, and a proper parameter is adjusted to adapt to the iteration process of the algorithm; and new particles are generated through explosion to increase the diversity of the population, so that the particles can explore more potential areas, and the overall population quality is improved. The feature selection method provided by the invention can obtain the optimal subset, and is beneficial to improving the subsequent learning classification of the data.

Description

Improved hybrid multi-target particle swarm optimization feature selection algorithm

Technical Field

The invention belongs to the field of machine learning, and particularly relates to an improved hybrid multi-objective particle swarm optimization feature selection (HIMOPSO) algorithm.

Background

With the improvement of data collection technology in modern society, data analysis objects generated from various fields are becoming more complicated, and dimensions of various data are also greatly increased, such as text analysis, bioinformatics, gene microarrays, and the like. Particularly in the field of biomedicine, the demand of human beings for data mining is increasing day by day, and meanwhile, the vigorous development of bioinformatics also greatly expands the dimensionality of a large amount of biomedicine data, and high-dimensional data is generated. Such as microarray datasets and gene expression profiles, are typical of high dimensional datasets.

The data do not generate value, and the value is brought to the development of the society only by finding the data which has useful information and knowledge, namely data mining. The classification is the most common and extensive task in the field of data mining at present, and becomes an effective means for data processing. However, as the data dimension is continuously increased, the data dimension can reach the world wide dimension, and the difficulty of data mining on the data is continuously increased. Therefore, the effective classification learning of high-dimensional small sample data is a difficult point in machine learning, and in practice, effective feature selection is carried out on medical data, so that the data classification learning is carried out, and the diagnosis and treatment of diseases are facilitated.

At present, research on feature selection mainly focuses on optimization search by using a heuristic algorithm, wherein the feature selection is widely performed by using a particle swarm algorithm, one reason is that the feature selection can be quickly converged, the number of parameters to be adjusted is small, and the particle swarm algorithm has good expansibility. However, as with all heuristic searches, particle swarm optimization is easily trapped in local optima. And the processing effect on high-dimensional data is not good because the space searched by the heuristic algorithm is too large. Therefore, the invention provides an improved hybrid multi-objective particle swarm optimization feature selection algorithm to solve the problems existing in the conventional feature selection method.

Disclosure of Invention

The invention aims to provide an improved hybrid multi-target particle swarm optimization feature selection algorithm, which can maintain the diversity of a population, enhance the quality of the population and provide a direction for effective searching of particles.

In order to achieve the purpose, the specific technical scheme of the invention is as follows:

an improved method for selecting characteristics of a hybrid multi-objective particle swarm optimization (HIMOPSO), the method comprises the following specific scheme:

in the first stage, namely the coarse-grained feature selection stage, firstly, a filtering algorithm of mixing Fisher score and MIC is adopted to optimize a search space, feature subsets obtained by using the filtering algorithm of mixing Fisher score and MIC are combined in a union mode, a large number of irrelevant and weakly relevant features are removed, meanwhile, the features are reserved, so that the search space of fine-grained feature selection is optimized, and a coarse-grained feature subset is generated.

In the second stage, namely the fine-grained feature selection stage, feature selection is carried out by utilizing a multi-target particle swarm optimization algorithm combined with explosive particles, so that the problem of particle oscillation is solved, and local optimization is skipped to obtain an optimal feature subset.

Particle encoding: in the algorithm, particle coding adopts a real number coding mode, each particle in a population is represented by a real number vector, and a characteristic set of the particle, namely a candidate solution, is coded. A particle that is greater than a set real number threshold (preset to 0.5) is designated 1, indicating that the corresponding feature is selected, otherwise the feature is not selected.

Initializing a population operation: the total number of particles in the population is N, and the size of the first portion is N1, accounting for 50% of the population. K characteristics selected through coarse-grained characteristics have prior knowledge of importance, the particle swarm can be initialized by utilizing the existing characteristic sequencing, the characteristic number is divided into three sections, the percentage of each section is respectively 50%, 30% and 20%, then the characteristics of each section, 30%, 10% and 5%, are respectively randomly selected from the sections, and position initialization is carried out. The second portion has a size of N2, which is another 50% of the population, and the population in this portion is initialized to be completely randomly acquired.

And adjusting nonlinear parameters based on iteration times: it is particularly important to select the appropriate parameter settings, so that the later search still can be conducted with a certain depth explorationThe nonlinear weight inertia coefficient is adjusted, so that the early numerical value is large, the later numerical value is small, the problem of particle oscillation is solved to a certain extent, and the particle search is facilitated. For particle search, there is a larger c in the initial stage ₁ And smaller c ₁ Later stage has smaller c ₁ And a larger c ₂ And the particle search is facilitated. Different from the strategy of using sine function, different c is used for particles with different iteration times ₁ And c ₂ A non-linear calculation mode. c. C ₁ Is c ₂ Respectively, the non-linear decreasing and the non-linear increasing.

Explosive particle-based offspring generation strategy: the particles are updated in a traditional mode, the particles are confused due to the problem of particle oscillation, potential particles are generated by adopting a particle explosion generation strategy, the defect of traditional particle updating is made up, and the convergence precision is improved. Under the condition of utilizing mutual information sharing among the particles, a distributed information sharing mechanism can be provided, the diversity of the particles in the population is increased by generating filial generations, the population quality is improved, the solution area is further effectively explored, and thus the important characteristics are found.

Particle selection strategy: and obtaining a progeny particle swarm by using a progeny generation strategy for the updated original particle, mixing the original particle and the progeny particle to obtain a candidate particle, and selecting the candidate particle to screen out the particle used in the next iteration. When the mixed particles are screened, an elite retention strategy is adopted to retain better particles, the number of the particles with better performance is selected to ensure the quality of the selected particles, and thus the diversity of an algorithm in the searching process is effectively retained.

And (3) setting a fitness function: the problem of feature selection is to select a small number of relevant features to obtain a classification performance similar to or even better than using all features. Therefore, two main conflicting objectives, namely, the optimization objective function, are mainly considered for feature selection, namely, the classification accuracy and the number of features.

And (3) external archive updating: and after each iteration of the external archive updating strategy based on the domination theory, the external archive needs to be updated, the dominated solution is deleted from the external archive through the domination and non-dominated relation, and the non-dominated solution is added into the external archive.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention adopts a mixed filtering algorithm to quickly remove the characteristics, optimizes the search space of the next stage and provides prior knowledge. In the second stage, a multi-target particle swarm algorithm is selected as a basic algorithm, a mixed initialization strategy is adopted, parameters are updated nonlinearly, explosive particles are generated in an iteration process, the diversity of the population can be maintained, the quality of the population is enhanced, and a direction is provided for effective searching of the particles.

2. The invention tests on 6 public medical data sets, the selected feature subset can obtain good classification effect in subsequent learning, and the selected data can be further deeply analyzed by professionals, thereby being beneficial to the diagnosis and treatment of diseases.

Drawings

FIG. 1 is a diagram of the underlying information of the data set used;

FIG. 2 is a flow chart of an improved hybrid multi-objective particle swarm optimization feature selection algorithm;

FIG. 3 is a graph of an iteration result of an improved hybrid multi-objective particle swarm optimization feature selection method seeded at 6 data sets;

FIG. 4 is a comparison of results of different feature selection methods;

FIG. 5 is a graph comparing ROC curves for different feature selection methods.

Detailed Description

The invention is further described with reference to the accompanying figures, taking as an example a typical CNS data set. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

An improved hybrid multi-objective particle swarm optimization feature selection algorithm comprises the following steps:

step 1: a data set is prepared, obtained from the BROAD institute cancer project, having a characteristic number of 7129 and a sample number of 60, which is characteristic of high-dimensional small sample data and is a two-class data set.

Step 2: data preprocessing operation: checking whether the data set has missing values, and if so, processing the missing values in a substitution or filling mode;

and step 3: the Fisher Score is used for calculating the feature correlation based on the probability distance to rank the features, the category correlation of the features is stronger when the values are larger, and the feature subset D is generated ₁ 。

And 4, step 4: calculating correlation between features and categories using maximum information coefficient MIC based on mutual information, comparing maximum mutual information values obtained from different grids; denote the filtered feature subset as D ₂

And 5: subset of features D ₁ And D ₂ And generating a coarse-grained feature subset D in a union mode.

Step 6: initializing particle swarm parameters, setting the population quantity to be 100 and the iteration number to be 100.

And 7: the population is initialized, and the total number of particles in the population is 100, and the population is divided into two parts. The number of the first part of particles is 50, the number of the features in the feature subset D is calculated, the feature number is divided into three segments, the percentage of each segment is 50%, 30% and 20%, then the features of 30%, 10% and 5% of each segment are randomly selected from the segments respectively, and position initialization is carried out. The second part has a population of 50 and the position of this part is randomly initialized.

And step 8: an external archive is initialized to store the non-dominated solution.

And step 9: cycling each particle, adjusting parameters in the population of particles:

updating the inertia weight coefficient: and nonlinear weight inertia coefficient adjustment is adopted. And (3) updating by adopting a formula (1) when the iteration number is 1-60, and updating by adopting a formula (2) when the iteration number is 60-100.

For early inertial weight coefficients, adopt w _e Expressed by adopting w for the inertia weight coefficient used in the middle and later periods _l Expressed, it tends to converge more locally. w is a _s Represents an initial value of w, w _f Represents the final value of w, t _iter For the current number of iterations, T _max Is the total number of iterations.

Updating learning factors c1, c2: c1 is c2 which is respectively in non-linear decreasing and non-linear increasing. And (3) updating by adopting a formula (3) when the iteration number is 1-60, and updating by adopting a formula (4) when the iteration number is 60-100.

For early individual learning factor and social learning factor, c _1e 、c _2e C represents the individual learning factor and social learning factor used in the middle and later stages _1l 、c _2l And (4) performing representation. c. C _1max 、c _1min Each represents c ₁ Respectively represent c ₂ Maximum and minimum values of, t _iter For the current number of iterations, T _max Is the total number of iterations

Step 10: the position and velocity of the particles are updated.

Step 11: and (5) carrying out particle explosion when certain iteration times are reached.

First, the range A of the particle explosion is determined from equation (5) _i 。

Wherein, the first and the second end of the pipe are connected with each other,

representing the maximum vibration amplitude, f (x) _i ) For the present assessment of the individual, y _min Showing the best particle performance in the current population, epsilon is a constant to avoid errors.

Determining the generated child position until the child particle generation is completed, which essentially is to perform a shift operation on each position of the particle, and add 0, A to the last position of the particle _i ]And updating the position of the new particle by using the random number.

And carrying out mutation operation on the particles to improve the diversity of the particles, and multiplying the particles by a random number meeting the Gaussian score under the action of the mutation operator.

Step 12: and mixing new particles generated by explosion with original particles, and retaining better particles by adopting an elite retention strategy when screening the mixed particles, so as to select the number of the particles with better performance, thereby ensuring the quality of the selected particles and effectively retaining the diversity of an algorithm in a searching process.

Step 13: and calculating the fitness function of the particles, updating the particle archive according to the fitness function, deleting the dominated solution from the external archive, and adding the non-dominated solution into the external archive.

Step 14: and calculating whether the iteration times are reached, outputting the feature subset when the iteration times are reached, and ending the whole algorithm. Otherwise, the procedure returns to step 9.

Claims

1. An improved hybrid multi-objective particle swarm optimization feature selection algorithm is characterized by comprising the following steps:

the method comprises the steps that coarse-grained feature selection is conducted in the first stage, correlation of features is calculated through a fisher score algorithm based on probability distance and an MIC algorithm based on mutual information, and the features are sorted and screened out according to the correlation to obtain two feature subsets; then, performing intersection processing on the two feature subsets to serve as prior knowledge of the next stage;

and in the second stage, fine-grained feature selection is carried out, and feature selection is carried out by utilizing a multi-target particle swarm optimization algorithm combined with explosive particles, so that the problem of particle oscillation is solved, and local optimization is skipped to obtain an optimal feature subset.

2. The improved hybrid multi-objective particle swarm optimization feature selection algorithm according to claim 1, wherein a fisher score algorithm based on probability distance and an MIC algorithm based on mutual information are adopted to calculate the correlation of features, and the features are sorted and screened out respectively according to the correlation to obtain two feature subsets; then, carrying out intersection processing on the two feature subsets to serve as prior knowledge of the next stage; wherein the Fisher Score is calculated as formula (1), and the MIC is calculated as formula (2);

wherein n is _i The number of the i-th class is,

and with

Are respectively the features f in the ith class _k Mean and variance of (d), mu _k For features f in all classes _k The mean value of (a);

MIC(f _k ,c)＝max _mn<B(N) {M(D) _m,n } (2)

wherein, B (N) is the upper limit divided by m x N of the grid; MIC (f) _k And c) has a value range of [0,1 ]]。

3. The improved hybrid multi-target particle swarm optimization feature selection algorithm according to claim 2, wherein in the second stage, feature selection is performed by using a multi-target particle swarm optimization algorithm combined with explosive particles, so that the problem of particle oscillation is solved, and local optimization is skipped to obtain an optimal feature subset;

the method specifically comprises the following steps:

particle encoding: in the algorithm, particle coding adopts a real number coding mode, each particle in a population is represented by a real number vector, and a characteristic set of one particle, namely a candidate solution, is coded; for particles larger than a set real number threshold (preset to 0.5), marking as 1 indicates that the corresponding feature is selected, otherwise, not selecting the feature;

initializing a population operation: the total number of particles in the population is N, the size of the first part is N1, and the first part accounts for 50% of the population; k characteristics selected through coarse-grained characteristics have prior knowledge of importance, the particle swarm can be initialized by utilizing the existing characteristic sequencing, the characteristic number is divided into three sections, the percentage of each section is respectively 50%, 30% and 20%, then the characteristics of each section, namely 30%, 10% and 5%, are respectively randomly selected from the sections, and the position initialization is carried out; the second part has a size N2, which is another 50% of the population, and this part of the population is initialized to be completely randomly acquired;

and (3) nonlinear parameter adjustment based on iteration times: the selection of proper parameter setting is particularly important, in order to enable later searching to still perform certain deep exploration, the nonlinear weight inertia coefficient adjustment is adopted, so that the early numerical value is large, the later numerical value is small, the problem of particle oscillation is solved to a certain extent, and the particle searching is facilitated; for particle search, there is a large c in the initial stage ₁ And smaller c ₁ Later stage has smaller c ₁ And a larger c ₂ The particle search is facilitated; different from the strategy of using sine function, different c is used for particles with different iteration times ₁ And c ₂ A non-linear calculation mode; c. C ₁ Is c ₂ Respectively, nonlinear decrement and nonlinear increment;

explosive particle-based offspring generation strategy: the particles are updated in a traditional mode, the particles are confused due to the problem of particle oscillation, potential particles are generated by adopting a particle explosion generation strategy to make up for the deficiency of traditional particle updating, and the convergence precision is improved; under the condition of utilizing mutual information sharing among the particles, a distributed information sharing mechanism can be provided, the diversity of the particles in the population is increased by generating filial generations, the population quality is improved, the solution area is further effectively explored, and thus the characteristic with importance is found;

particle selection strategy: obtaining a progeny particle swarm by utilizing a progeny generation strategy for the updated original particles, mixing the original particles and the progeny particles to obtain candidate particles, and selecting the candidate particles to screen out the particles used in the next iteration; when the mixed particles are screened, an elite retention strategy is adopted to retain better particles, and the number of the particles with better performance is selected to ensure the quality of the selected particles, so that the diversity of an algorithm in the searching process is effectively retained;

and (3) setting a fitness function: the feature selection problem is to select a small number of relevant features to obtain a similar or even better classification performance than using all features; therefore, two main mutually conflicting targets, namely an optimization objective function, are mainly considered for feature selection, namely classification accuracy and feature quantity;

and (3) updating external archives: and after each iteration of the external archive updating strategy based on the domination theory, the external archive needs to be updated, the dominated solution is deleted from the external archive through the domination and non-dominated relation, and the non-dominated solution is added into the external archive.