CN115249054A - Improved hybrid multi-target particle swarm optimization feature selection algorithm - Google Patents

Improved hybrid multi-target particle swarm optimization feature selection algorithm Download PDF

Info

Publication number
CN115249054A
CN115249054A CN202210202109.8A CN202210202109A CN115249054A CN 115249054 A CN115249054 A CN 115249054A CN 202210202109 A CN202210202109 A CN 202210202109A CN 115249054 A CN115249054 A CN 115249054A
Authority
CN
China
Prior art keywords
particles
particle
algorithm
feature
population
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210202109.8A
Other languages
Chinese (zh)
Inventor
潘晓英
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202210202109.8A priority Critical patent/CN115249054A/en
Publication of CN115249054A publication Critical patent/CN115249054A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses an improved hybrid multi-objective particle swarm optimization feature selection algorithm (HIMOPSO), and belongs to the field of machine learning. The core idea is that in the first stage, correlation of features is calculated by adopting a fisher score algorithm based on probability distance and an MIC algorithm based on mutual information, and features are sorted and screened to obtain two feature subsets; then, the two feature subsets are subjected to intersection processing. In the second stage, a multi-target particle swarm algorithm is adopted as a search algorithm for feature selection, and a particle initialization method giving consideration to both population diversity and priori knowledge is provided; the algorithm combines the iteration times to carry out nonlinear improvement on the parameters of the multi-target particle swarm algorithm, and a proper parameter is adjusted to adapt to the iteration process of the algorithm; and new particles are generated through explosion to increase the diversity of the population, so that the particles can explore more potential areas, and the overall population quality is improved. The feature selection method provided by the invention can obtain the optimal subset, and is beneficial to improving the subsequent learning classification of the data.

Description

Improved hybrid multi-target particle swarm optimization feature selection algorithm
Technical Field
The invention belongs to the field of machine learning, and particularly relates to an improved hybrid multi-objective particle swarm optimization feature selection (HIMOPSO) algorithm.
Background
With the improvement of data collection technology in modern society, data analysis objects generated from various fields are becoming more complicated, and dimensions of various data are also greatly increased, such as text analysis, bioinformatics, gene microarrays, and the like. Particularly in the field of biomedicine, the demand of human beings for data mining is increasing day by day, and meanwhile, the vigorous development of bioinformatics also greatly expands the dimensionality of a large amount of biomedicine data, and high-dimensional data is generated. Such as microarray datasets and gene expression profiles, are typical of high dimensional datasets.
The data do not generate value, and the value is brought to the development of the society only by finding the data which has useful information and knowledge, namely data mining. The classification is the most common and extensive task in the field of data mining at present, and becomes an effective means for data processing. However, as the data dimension is continuously increased, the data dimension can reach the world wide dimension, and the difficulty of data mining on the data is continuously increased. Therefore, the effective classification learning of high-dimensional small sample data is a difficult point in machine learning, and in practice, effective feature selection is carried out on medical data, so that the data classification learning is carried out, and the diagnosis and treatment of diseases are facilitated.
At present, research on feature selection mainly focuses on optimization search by using a heuristic algorithm, wherein the feature selection is widely performed by using a particle swarm algorithm, one reason is that the feature selection can be quickly converged, the number of parameters to be adjusted is small, and the particle swarm algorithm has good expansibility. However, as with all heuristic searches, particle swarm optimization is easily trapped in local optima. And the processing effect on high-dimensional data is not good because the space searched by the heuristic algorithm is too large. Therefore, the invention provides an improved hybrid multi-objective particle swarm optimization feature selection algorithm to solve the problems existing in the conventional feature selection method.
Disclosure of Invention
The invention aims to provide an improved hybrid multi-target particle swarm optimization feature selection algorithm, which can maintain the diversity of a population, enhance the quality of the population and provide a direction for effective searching of particles.
In order to achieve the purpose, the specific technical scheme of the invention is as follows:
an improved method for selecting characteristics of a hybrid multi-objective particle swarm optimization (HIMOPSO), the method comprises the following specific scheme:
in the first stage, namely the coarse-grained feature selection stage, firstly, a filtering algorithm of mixing Fisher score and MIC is adopted to optimize a search space, feature subsets obtained by using the filtering algorithm of mixing Fisher score and MIC are combined in a union mode, a large number of irrelevant and weakly relevant features are removed, meanwhile, the features are reserved, so that the search space of fine-grained feature selection is optimized, and a coarse-grained feature subset is generated.
In the second stage, namely the fine-grained feature selection stage, feature selection is carried out by utilizing a multi-target particle swarm optimization algorithm combined with explosive particles, so that the problem of particle oscillation is solved, and local optimization is skipped to obtain an optimal feature subset.
Particle encoding: in the algorithm, particle coding adopts a real number coding mode, each particle in a population is represented by a real number vector, and a characteristic set of the particle, namely a candidate solution, is coded. A particle that is greater than a set real number threshold (preset to 0.5) is designated 1, indicating that the corresponding feature is selected, otherwise the feature is not selected.
Initializing a population operation: the total number of particles in the population is N, and the size of the first portion is N1, accounting for 50% of the population. K characteristics selected through coarse-grained characteristics have prior knowledge of importance, the particle swarm can be initialized by utilizing the existing characteristic sequencing, the characteristic number is divided into three sections, the percentage of each section is respectively 50%, 30% and 20%, then the characteristics of each section, 30%, 10% and 5%, are respectively randomly selected from the sections, and position initialization is carried out. The second portion has a size of N2, which is another 50% of the population, and the population in this portion is initialized to be completely randomly acquired.
And adjusting nonlinear parameters based on iteration times: it is particularly important to select the appropriate parameter settings, so that the later search still can be conducted with a certain depth explorationThe nonlinear weight inertia coefficient is adjusted, so that the early numerical value is large, the later numerical value is small, the problem of particle oscillation is solved to a certain extent, and the particle search is facilitated. For particle search, there is a larger c in the initial stage 1 And smaller c 1 Later stage has smaller c 1 And a larger c 2 And the particle search is facilitated. Different from the strategy of using sine function, different c is used for particles with different iteration times 1 And c 2 A non-linear calculation mode. c. C 1 Is c 2 Respectively, the non-linear decreasing and the non-linear increasing.
Explosive particle-based offspring generation strategy: the particles are updated in a traditional mode, the particles are confused due to the problem of particle oscillation, potential particles are generated by adopting a particle explosion generation strategy, the defect of traditional particle updating is made up, and the convergence precision is improved. Under the condition of utilizing mutual information sharing among the particles, a distributed information sharing mechanism can be provided, the diversity of the particles in the population is increased by generating filial generations, the population quality is improved, the solution area is further effectively explored, and thus the important characteristics are found.
Particle selection strategy: and obtaining a progeny particle swarm by using a progeny generation strategy for the updated original particle, mixing the original particle and the progeny particle to obtain a candidate particle, and selecting the candidate particle to screen out the particle used in the next iteration. When the mixed particles are screened, an elite retention strategy is adopted to retain better particles, the number of the particles with better performance is selected to ensure the quality of the selected particles, and thus the diversity of an algorithm in the searching process is effectively retained.
And (3) setting a fitness function: the problem of feature selection is to select a small number of relevant features to obtain a classification performance similar to or even better than using all features. Therefore, two main conflicting objectives, namely, the optimization objective function, are mainly considered for feature selection, namely, the classification accuracy and the number of features.
And (3) external archive updating: and after each iteration of the external archive updating strategy based on the domination theory, the external archive needs to be updated, the dominated solution is deleted from the external archive through the domination and non-dominated relation, and the non-dominated solution is added into the external archive.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention adopts a mixed filtering algorithm to quickly remove the characteristics, optimizes the search space of the next stage and provides prior knowledge. In the second stage, a multi-target particle swarm algorithm is selected as a basic algorithm, a mixed initialization strategy is adopted, parameters are updated nonlinearly, explosive particles are generated in an iteration process, the diversity of the population can be maintained, the quality of the population is enhanced, and a direction is provided for effective searching of the particles.
2. The invention tests on 6 public medical data sets, the selected feature subset can obtain good classification effect in subsequent learning, and the selected data can be further deeply analyzed by professionals, thereby being beneficial to the diagnosis and treatment of diseases.
Drawings
FIG. 1 is a diagram of the underlying information of the data set used;
FIG. 2 is a flow chart of an improved hybrid multi-objective particle swarm optimization feature selection algorithm;
FIG. 3 is a graph of an iteration result of an improved hybrid multi-objective particle swarm optimization feature selection method seeded at 6 data sets;
FIG. 4 is a comparison of results of different feature selection methods;
FIG. 5 is a graph comparing ROC curves for different feature selection methods.
Detailed Description
The invention is further described with reference to the accompanying figures, taking as an example a typical CNS data set. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
An improved hybrid multi-objective particle swarm optimization feature selection algorithm comprises the following steps:
step 1: a data set is prepared, obtained from the BROAD institute cancer project, having a characteristic number of 7129 and a sample number of 60, which is characteristic of high-dimensional small sample data and is a two-class data set.
Step 2: data preprocessing operation: checking whether the data set has missing values, and if so, processing the missing values in a substitution or filling mode;
and step 3: the Fisher Score is used for calculating the feature correlation based on the probability distance to rank the features, the category correlation of the features is stronger when the values are larger, and the feature subset D is generated 1
And 4, step 4: calculating correlation between features and categories using maximum information coefficient MIC based on mutual information, comparing maximum mutual information values obtained from different grids; denote the filtered feature subset as D 2
And 5: subset of features D 1 And D 2 And generating a coarse-grained feature subset D in a union mode.
Step 6: initializing particle swarm parameters, setting the population quantity to be 100 and the iteration number to be 100.
And 7: the population is initialized, and the total number of particles in the population is 100, and the population is divided into two parts. The number of the first part of particles is 50, the number of the features in the feature subset D is calculated, the feature number is divided into three segments, the percentage of each segment is 50%, 30% and 20%, then the features of 30%, 10% and 5% of each segment are randomly selected from the segments respectively, and position initialization is carried out. The second part has a population of 50 and the position of this part is randomly initialized.
And step 8: an external archive is initialized to store the non-dominated solution.
And step 9: cycling each particle, adjusting parameters in the population of particles:
updating the inertia weight coefficient: and nonlinear weight inertia coefficient adjustment is adopted. And (3) updating by adopting a formula (1) when the iteration number is 1-60, and updating by adopting a formula (2) when the iteration number is 60-100.
Figure RE-GDA0003732948240000031
Figure RE-GDA0003732948240000032
For early inertial weight coefficients, adopt w e Expressed by adopting w for the inertia weight coefficient used in the middle and later periods l Expressed, it tends to converge more locally. w is a s Represents an initial value of w, w f Represents the final value of w, t iter For the current number of iterations, T max Is the total number of iterations.
Updating learning factors c1, c2: c1 is c2 which is respectively in non-linear decreasing and non-linear increasing. And (3) updating by adopting a formula (3) when the iteration number is 1-60, and updating by adopting a formula (4) when the iteration number is 60-100.
Figure RE-GDA0003732948240000041
Figure RE-GDA0003732948240000042
For early individual learning factor and social learning factor, c 1e 、c 2e C represents the individual learning factor and social learning factor used in the middle and later stages 1l 、c 2l And (4) performing representation. c. C 1max 、c 1min Each represents c 1 Respectively represent c 2 Maximum and minimum values of, t iter For the current number of iterations, T max Is the total number of iterations
Step 10: the position and velocity of the particles are updated.
Step 11: and (5) carrying out particle explosion when certain iteration times are reached.
First, the range A of the particle explosion is determined from equation (5) i
Figure RE-GDA0003732948240000043
Wherein, the first and the second end of the pipe are connected with each other,
Figure RE-GDA0003732948240000044
representing the maximum vibration amplitude, f (x) i ) For the present assessment of the individual, y min Showing the best particle performance in the current population, epsilon is a constant to avoid errors.
Determining the generated child position until the child particle generation is completed, which essentially is to perform a shift operation on each position of the particle, and add 0, A to the last position of the particle i ]And updating the position of the new particle by using the random number.
And carrying out mutation operation on the particles to improve the diversity of the particles, and multiplying the particles by a random number meeting the Gaussian score under the action of the mutation operator.
Step 12: and mixing new particles generated by explosion with original particles, and retaining better particles by adopting an elite retention strategy when screening the mixed particles, so as to select the number of the particles with better performance, thereby ensuring the quality of the selected particles and effectively retaining the diversity of an algorithm in a searching process.
Step 13: and calculating the fitness function of the particles, updating the particle archive according to the fitness function, deleting the dominated solution from the external archive, and adding the non-dominated solution into the external archive.
Step 14: and calculating whether the iteration times are reached, outputting the feature subset when the iteration times are reached, and ending the whole algorithm. Otherwise, the procedure returns to step 9.

Claims (3)

1. An improved hybrid multi-objective particle swarm optimization feature selection algorithm is characterized by comprising the following steps:
the method comprises the steps that coarse-grained feature selection is conducted in the first stage, correlation of features is calculated through a fisher score algorithm based on probability distance and an MIC algorithm based on mutual information, and the features are sorted and screened out according to the correlation to obtain two feature subsets; then, performing intersection processing on the two feature subsets to serve as prior knowledge of the next stage;
and in the second stage, fine-grained feature selection is carried out, and feature selection is carried out by utilizing a multi-target particle swarm optimization algorithm combined with explosive particles, so that the problem of particle oscillation is solved, and local optimization is skipped to obtain an optimal feature subset.
2. The improved hybrid multi-objective particle swarm optimization feature selection algorithm according to claim 1, wherein a fisher score algorithm based on probability distance and an MIC algorithm based on mutual information are adopted to calculate the correlation of features, and the features are sorted and screened out respectively according to the correlation to obtain two feature subsets; then, carrying out intersection processing on the two feature subsets to serve as prior knowledge of the next stage; wherein the Fisher Score is calculated as formula (1), and the MIC is calculated as formula (2);
Figure RE-FDA0003861302480000011
wherein n is i The number of the i-th class is,
Figure RE-FDA0003861302480000012
and with
Figure RE-FDA0003861302480000013
Are respectively the features f in the ith class k Mean and variance of (d), mu k For features f in all classes k The mean value of (a);
MIC(f k ,c)=max mn<B(N) {M(D) m,n } (2)
wherein, B (N) is the upper limit divided by m x N of the grid; MIC (f) k And c) has a value range of [0,1 ]]。
3. The improved hybrid multi-target particle swarm optimization feature selection algorithm according to claim 2, wherein in the second stage, feature selection is performed by using a multi-target particle swarm optimization algorithm combined with explosive particles, so that the problem of particle oscillation is solved, and local optimization is skipped to obtain an optimal feature subset;
the method specifically comprises the following steps:
particle encoding: in the algorithm, particle coding adopts a real number coding mode, each particle in a population is represented by a real number vector, and a characteristic set of one particle, namely a candidate solution, is coded; for particles larger than a set real number threshold (preset to 0.5), marking as 1 indicates that the corresponding feature is selected, otherwise, not selecting the feature;
initializing a population operation: the total number of particles in the population is N, the size of the first part is N1, and the first part accounts for 50% of the population; k characteristics selected through coarse-grained characteristics have prior knowledge of importance, the particle swarm can be initialized by utilizing the existing characteristic sequencing, the characteristic number is divided into three sections, the percentage of each section is respectively 50%, 30% and 20%, then the characteristics of each section, namely 30%, 10% and 5%, are respectively randomly selected from the sections, and the position initialization is carried out; the second part has a size N2, which is another 50% of the population, and this part of the population is initialized to be completely randomly acquired;
and (3) nonlinear parameter adjustment based on iteration times: the selection of proper parameter setting is particularly important, in order to enable later searching to still perform certain deep exploration, the nonlinear weight inertia coefficient adjustment is adopted, so that the early numerical value is large, the later numerical value is small, the problem of particle oscillation is solved to a certain extent, and the particle searching is facilitated; for particle search, there is a large c in the initial stage 1 And smaller c 1 Later stage has smaller c 1 And a larger c 2 The particle search is facilitated; different from the strategy of using sine function, different c is used for particles with different iteration times 1 And c 2 A non-linear calculation mode; c. C 1 Is c 2 Respectively, nonlinear decrement and nonlinear increment;
explosive particle-based offspring generation strategy: the particles are updated in a traditional mode, the particles are confused due to the problem of particle oscillation, potential particles are generated by adopting a particle explosion generation strategy to make up for the deficiency of traditional particle updating, and the convergence precision is improved; under the condition of utilizing mutual information sharing among the particles, a distributed information sharing mechanism can be provided, the diversity of the particles in the population is increased by generating filial generations, the population quality is improved, the solution area is further effectively explored, and thus the characteristic with importance is found;
particle selection strategy: obtaining a progeny particle swarm by utilizing a progeny generation strategy for the updated original particles, mixing the original particles and the progeny particles to obtain candidate particles, and selecting the candidate particles to screen out the particles used in the next iteration; when the mixed particles are screened, an elite retention strategy is adopted to retain better particles, and the number of the particles with better performance is selected to ensure the quality of the selected particles, so that the diversity of an algorithm in the searching process is effectively retained;
and (3) setting a fitness function: the feature selection problem is to select a small number of relevant features to obtain a similar or even better classification performance than using all features; therefore, two main mutually conflicting targets, namely an optimization objective function, are mainly considered for feature selection, namely classification accuracy and feature quantity;
and (3) updating external archives: and after each iteration of the external archive updating strategy based on the domination theory, the external archive needs to be updated, the dominated solution is deleted from the external archive through the domination and non-dominated relation, and the non-dominated solution is added into the external archive.
CN202210202109.8A 2022-03-02 2022-03-02 Improved hybrid multi-target particle swarm optimization feature selection algorithm Pending CN115249054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210202109.8A CN115249054A (en) 2022-03-02 2022-03-02 Improved hybrid multi-target particle swarm optimization feature selection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210202109.8A CN115249054A (en) 2022-03-02 2022-03-02 Improved hybrid multi-target particle swarm optimization feature selection algorithm

Publications (1)

Publication Number Publication Date
CN115249054A true CN115249054A (en) 2022-10-28

Family

ID=83699158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210202109.8A Pending CN115249054A (en) 2022-03-02 2022-03-02 Improved hybrid multi-target particle swarm optimization feature selection algorithm

Country Status (1)

Country Link
CN (1) CN115249054A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541686A (en) * 2022-11-01 2023-08-04 河海大学 Electric energy quality disturbance classification method based on multi-domain feature fusion extreme learning machine
CN117033965A (en) * 2023-08-11 2023-11-10 湖北工业大学 Biological vaccine data characteristic selection method, device, equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541686A (en) * 2022-11-01 2023-08-04 河海大学 Electric energy quality disturbance classification method based on multi-domain feature fusion extreme learning machine
CN116541686B (en) * 2022-11-01 2024-03-15 河海大学 Electric energy quality disturbance classification method based on multi-domain feature fusion extreme learning machine
CN117033965A (en) * 2023-08-11 2023-11-10 湖北工业大学 Biological vaccine data characteristic selection method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN108334949B (en) Image classifier construction method based on optimized deep convolutional neural network structure fast evolution
CN115249054A (en) Improved hybrid multi-target particle swarm optimization feature selection algorithm
JP6240804B1 (en) Filtered feature selection algorithm based on improved information measurement and GA
Xue et al. Multi-objective feature selection in classification: A differential evolution approach
Nguyen et al. Particle swarm optimisation with genetic operators for feature selection
Zhou et al. A correlation guided genetic algorithm and its application to feature selection
CN111582428A (en) Multi-modal multi-objective optimization method based on grey prediction evolution algorithm
Pei et al. Genetic algorithms for classification and feature extraction
Xue et al. An archive based particle swarm optimisation for feature selection in classification
CN113222002A (en) Zero sample classification method based on generative discriminative contrast optimization
Saha et al. Exploiting linear interpolation of variational autoencoders for satisfying preferences in evolutionary design optimization
CN112183598A (en) Feature selection method based on genetic algorithm
CN115185732A (en) Software defect prediction method fusing genetic algorithm and deep neural network
CN114997303A (en) Bladder cancer metabolic marker screening method and system based on deep learning
CN112200224B (en) Medical image feature processing method and device
CN115394381A (en) High-entropy alloy hardness prediction method and device based on machine learning and two-step data expansion
CN110263906B (en) Asymmetric negative correlation search method
CN114863508A (en) Expression recognition model generation method, medium and device of adaptive attention mechanism
CN112529179A (en) Genetic algorithm-based confrontation training method and device and computer storage medium
CN111260077A (en) Method and device for determining hyper-parameters of business processing model
Chuang et al. Chaotic binary particle swarm optimization for feature selection using logistic map
Klemmer et al. Sxl: Spatially explicit learning of geographic processes with auxiliary tasks
CN114077895A (en) Variational self-coding model of antagonism strategy
Indira et al. Association rule mining using genetic algorithm: The role of estimation parameters
Kawa et al. Designing convolution neural network architecture by utilizing the complexity model of the dataset

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination