CN113554144A - Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm - Google Patents
Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm Download PDFInfo
- Publication number
- CN113554144A CN113554144A CN202110856379.6A CN202110856379A CN113554144A CN 113554144 A CN113554144 A CN 113554144A CN 202110856379 A CN202110856379 A CN 202110856379A CN 113554144 A CN113554144 A CN 113554144A
- Authority
- CN
- China
- Prior art keywords
- population
- adaptive
- initialization
- self
- initial solution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Abstract
The invention relates to the technical field of algorithm optimization, in particular to a self-adaptive population initialization method and storage equipment for a multi-objective evolutionary feature selection algorithm. The self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm comprises the following steps: creating a zero matrix M with n rows and d columns; and comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is larger than the population scale, performing self-adaptive initialization on the initial solution quantity K, calculating a sub-population distribution symmetry axis H of the initial solution set needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, performing self-adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking H as the symmetry axis, and finishing the population initialization. By the method, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process.
Description
Technical Field
The invention relates to the technical field of algorithm optimization, in particular to a self-adaptive population initialization method and storage equipment for a multi-objective evolutionary feature selection algorithm.
Background
In recent years, China gradually enters a big data and artificial intelligence era, and the requirement of people on large-scale data classification is increasing day by day. However, such data often has a huge amount of features, and the features include not only related features, but also many unrelated and redundant features, and complex interactions between features, which all bring difficulties to the learning of the classification model. Feature selection is classified by identifying and screening a part of features from all features, is a general means for solving the above problems, is also a necessary path for big data processing, belongs to a non-convex and discrete complex optimization problem, and generally has the following two key optimization objectives (which are also the optimization objectives concerned by the patent research): 1, minimizing the classification error rate; and 2, minimizing the proportion of the selected features to the total features (called the selected feature rate for short). However, with the increasing feature number, the induced dimensionality disaster not only causes the exponential increase of the search space for feature selection, the increase of local optimal traps and infeasible solution areas, which seriously affects the search efficiency and convergence rate of the feature selection algorithm, but also causes the interaction among features to tend to be complex, various irrelevant and redundant features become more, so that the difference among the selected feature subsets becomes smaller, and the repeated and redundant solutions increase. These problems do not challenge the limit of the existing feature selection algorithm, it is obviously unrealistic to search by an exhaustion method, and most heuristic search techniques, such as simulated annealing, tabu search and greedy algorithm, cannot effectively process a huge and complex feature search space. Therefore, a new feature selection method is needed to meet the increasing difficulty of large-scale feature search and the requirement of high-dimensional data classification.
The evolutionary algorithm is a meta-heuristic search method derived from Darwinian culture evolution hypothesis and the law of winning or losing nature. The method realizes iterative optimization updating of the population through three main evolutionary stages of simulating population initialization, offspring propagation and environment selection so as to achieve global search of a Pareto optimal solution set. It has the following advantages: 1, an optimized object is taken as a 'black box' problem without making any assumption on data; 2, the method does not need domain knowledge, but can be combined with the existing methods (such as local search) in other fields to form a better mixed search method; 3, complex parameter setting is not needed, the structure is simple, and the operation and the improvement are easy; and 4, searching the solution in a population mode, and simultaneously obtaining a feasible solution of a set in one iteration, thereby not only improving the searching efficiency, but also being beneficial to jumping out the local optimal trap. Owing to the above characteristics, especially the group intelligent search mode, the evolutionary algorithm is widely applied to solving the multi-objective optimization problem, and is also a powerful means for solving the large-scale multi-objective feature selection problem. However, most of the existing evolutionary algorithms for studying the large-scale multi-target feature selection problem focus on two evolutionary stages of offspring propagation and environment selection, and few scholars develop more intensive research specifically aiming at the evolutionary characteristics of the population initialization stage. The method is characterized in that a currently more classical and commonly used multi-objective evolutionary algorithm NSGA-II: for example, A fast and elitist multi-objective genetic algorithm NSGA-II (a fast and excellent multi-objective genetic algorithm NSGA-II), the main flow chart is shown in FIG. 1. Wherein, step 1 is a population initialization phase, steps 2 and 3 are offspring breeding phases, and steps 4, 5 and 6 are environment selection phases.
As shown in fig. 1, the conventional non-dominated sorting based NSGA-II multi-objective evolutionary algorithm does not perform any special processing and design on the random initialization process of the population according to the problem characteristics of large-scale feature selection. However, for large scale feature spaces, the size of the population is often limited. For example, a feature selection problem of 10000 feature numbers is limited by computing resources and computing time, and it is likely that only populations of 100 solutions are used to search for optimizations. Under the above circumstances, the randomly sampled initial population cannot cover the feature space far beyond its own scale at all, and even if the population converges towards the ideal Pareto front curve along with the evolution of the evolution, the effective feature scale to be covered is reduced, it is often difficult to make each selected feature area have a feasible solution corresponding to it. The problem is a weak link and a research blind area in the field of large-scale multi-objective evolutionary feature selection research, and is a technical bottleneck which must be broken in designing a large-scale multi-objective evolutionary feature selection algorithm with high efficiency and high precision.
Disclosure of Invention
Therefore, a self-adaptive population initialization method for a multi-objective evolutionary feature selection algorithm is needed to be provided, so as to solve the technical problems of low population convergence rate and poor final classification effect and quality of a multi-objective optimization solution set of the multi-objective evolutionary feature selection algorithm. The specific technical scheme is as follows:
the self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm comprises the following steps:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
Further, before the step S101, the method specifically includes the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
Further, the step S104 specifically includes the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size.
Further, the step S105 specifically includes the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
Further, the step S106 specifically includes the steps of:
a certain row solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, x i0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
Further, the step S103 specifically includes the steps of:
all solution vectors are discretely evenly distributed and randomly sampled from the horizontal axis of the target space with a value of 0.5.
In order to solve the technical problem, the storage device is further provided, and the specific technical scheme is as follows:
a storage device having stored therein a set of instructions for performing:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
Further, the set of instructions is further for performing: the step S101 is preceded by the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
Further, the set of instructions is further for performing: the step S104 specifically further includes the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population scale, and D represents the characteristic scale;
the step S105 specifically further includes the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
Further, the set of instructions is further for performing: the step S106 further includes:
a certain row solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
The invention has the beneficial effects that: the self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm comprises the following steps: step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set; step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104; step S103: initializing the population by adopting a conventional random sampling method, and entering step S108; step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105; step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106; step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107; step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108; step S108: and finishing population initialization. By the method, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process. And further, the population convergence speed of the multi-target evolutionary feature selection algorithm is increased, and the final classification effect and quality of the multi-target optimization solution set are improved.
Drawings
FIG. 1 is a flow chart of a conventional NSGA-II multi-objective evolutionary algorithm based on non-dominated sorting in the background art;
FIG. 2 is a flow chart of a method for adaptive population initialization for a multi-objective evolutionary feature selection algorithm, according to an embodiment;
FIG. 3 is an example of initial solution set distribution under a conventional population initialization mechanism used by the conventional evolutionary algorithm according to the embodiment;
FIG. 4 is an exemplary distribution of initial solution sets under the adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to the embodiment;
FIG. 5 is a diagram illustrating classification dataset names, feature quantities, sample quantities, and category numbers for each test case according to an embodiment;
FIG. 6 is a diagram illustrating a comparison of performance of an adaptive population initialization method according to an embodiment with other classical algorithms;
FIG. 7 is a block diagram of a storage device according to an embodiment.
Description of reference numerals:
700. a storage device.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
Referring to fig. 2 to 6, in the present embodiment, the adaptive population initialization method for the multi-objective evolutionary feature selection algorithm may be applied to a storage device, including but not limited to: personal computers, servers, general purpose computers, special purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, etc.
The core technical idea of the application is as follows:
the specific technical scheme is as follows: the method mainly aims at improving the population initialization stage in three evolution stages (population initialization, offspring propagation and environment selection) of the multi-objective evolutionary algorithm, and designs a self-adaptive population initialization mechanism which adaptively adjusts the random distribution position and the rule of an initial solution in a feature space according to the proportional relation between the population size and the dimension of the feature space, so that the population convergence speed of the multi-objective evolutionary feature selection algorithm is improved, and the final classification effect and quality of a multi-objective optimized solution set are improved. First, theThe mechanism adopts a 01 coding mode to carry out gene coding on a solution vector, namely: a genetic locus with a value of 0 indicates that the corresponding feature was not selected, and a genetic locus with a value of 1 indicates that the corresponding feature was selected. Taking a feature space with dimension D as an example, the feature vector dimension of the space is also D, and it is assumed that a solution vector in the space is X ═ X (X)1,x2,x3,...,xD) Then xi0 means that the ith feature is not selected, and xi1 indicates that the ith feature is selected. Therefore, all the gene sets with the numerical value of 1 in the solution vector represent the selected feature subset in the feature space, and how to obtain higher classification accuracy, namely the final optimization goal of the multi-objective evolutionary feature selection algorithm, by using fewer selected features.
After determining 01 encoding mode, the adaptive population initialization mechanism used in the present application is specifically implemented as follows
Step S201: a zero matrix M of n rows and d columns is created, which represents the overall initial solution set. As shown in equation (1), M is used to represent the set of all initial solutions in the population with 01 coding patterns to set the chromosomal locus, and each row in M represents the feature vector of one initial solution.
Step S202: is the feature dimension less than or equal to the population size? If the characteristic dimension is less than or equal to the population size, executing step S203: and initializing the population by adopting a conventional random sampling method. When the feature dimension is smaller than or equal to the population scale, the population size is enough to cover the whole feature space, and at least one solution in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value can be ensured to be covered in a correlated manner, so that the population initialization can be directly carried out by adopting a conventional random sampling method.
The conventional population initialization method will perform discrete evenly distributed random sampling of all solution vectors directly from the horizontal axis of the target space with a value of 0.5, regardless of whether the population size is sufficient to cover a large-scale feature space. In the adaptive population initialization mechanism provided by the application, the step only acts on the condition that D is less than or equal to N, namely the condition that the population size is judged to be enough to cover the whole feature space can be ensured, and at least one solution can be associated and covered in a scene in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value; or on the remaining yet uninitialized solution vectors after the adaptive initialization branch is taken.
If the characteristic dimension is larger than the population size, executing step S204: and calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the population scale and the ratio of the characteristic dimension. The step S204 further includes:
the initial solution number K is calculated by the following formula (2):
K=ceil(N*max(0.2,N/D)) (2)
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size. Thus, the value of K is between 0.2 XN and NX (N/D), i.e. at least 20% of the total population.
Step S205: and calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the population scale and the ratio of the characteristic dimension. The method specifically comprises the following steps:
the symmetry axis H is calculated by the following equation (3):
H=0.5*N/D (3)
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector. Obviously, this range of values lies between 0 and 1. The population distribution symmetry axis of the conventional initial solution random sampling method is generally set to 0.5, i.e., random sampling is distributed in a discrete uniform distribution starting from half of the horizontal axis. According to the self-adaptive initialization method, the boundary of the cross shaft of the target space with the population distribution symmetric axial numerical value of 0 is moved in a self-adaptive mode according to the ratio of the population scale N to the feature scale D, so that the search focus and the region of interest corresponding to the sub-population adopting the self-adaptive initialization are moved forward, and feasible solutions with fewer selected features are searched by concentrated force. The benefits of this are mainly two: 1, the selected characteristic number is reduced, so that the range of feasible solutions is greatly reduced, the difficulty of population convergence is greatly reduced, and the population convergence speed is improved; 2, a small amount of elite characteristics are searched by centralized computing resources, a large amount of redundant and irrelevant characteristics in a large-scale characteristic space are avoided, and the searching efficiency of the algorithm is greatly improved.
Step S206: and carrying out self-adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis. The method specifically comprises the following steps:
solving vector X (X) by a certain line1,x2,x3,...,xD) For example, the specific method is as follows: for each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected; and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
Step S207: and randomly sampling all the solution vectors remained from the K +1 th row to the end in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S203. That is, all the discrete uniform distribution random samples are taken from the horizontal axis of the target space with the value of 0.5, the basic flow is similar to step S205, but the parameter of the symmetry axis of the population distribution is changed from H to 0.5.
Step S208: and finishing population initialization.
By the method, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process. And further, the population convergence speed of the multi-target evolutionary feature selection algorithm is increased, and the final classification effect and quality of the multi-target optimization solution set are improved.
The beneficial effects of the adaptive population initialization mechanism of the present application are further explained with actual data as follows:
as shown in fig. 3, the initial solution set distribution of the initial population of NSGA-II in two target spaces using the conventional population initialization method is illustrated, where target 1 (horizontal axis) represents the proportion of the selected features to the total features, and target 2 (vertical axis) represents the classification error rate of the selected feature subset in the classifier classification, and obviously, target 1 and target 2 need to be as small as possible.
As shown in fig. 4, the algorithm AIEA using the adaptive population initialization mechanism proposed in the present application is illustrated. The AIEA algorithm is an abbreviation of adaptive initialization based evolution algorithm, and is a new algorithm constructed after the proposed adaptive population initialization mechanism is migrated to the original NSGA-II algorithm architecture, that is, the adaptive population initialization mechanism of fig. 2 replaces the population initialization process of step 1 of fig. 1. Compared with fig. 3, the initial solution set of fig. 4 is divided into two sub-populations respectively distributed in the front and back sub-search regions of the target space. Therefore, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has an advantage in breadth in the searching process.
Moreover, the initial solution of the top sub-population in fig. 4 selects fewer features than that in fig. 3, which means that the search area needs to search for fewer combinations of features, which means that the search difficulty and the population convergence difficulty are greatly reduced. Therefore, the sub-population generated by the adaptive initialization mechanism can complete convergence more quickly, thereby driving the sub-population initialized in a conventional manner to search the feature space together with higher efficiency.
In order to further verify the effect of the adaptive population initialization mechanism provided by the application, a plurality of different multi-objective evolutionary algorithms which are most classical and popular at present are introduced, and the method is compared with the multi-objective evolutionary method (namely AIEA) based on the adaptive population initialization mechanism. The comparison algorithm introduced is as follows:
NSGA-II: a fast and elitist multiobjective genetic algorithm NSGA-II (a fast and excellent multi-objective genetic algorithm: NSGA-II). The method adopts non-domination ordering based on Pareto domination relation as a primary environment selection standard, and is assisted by a diversity maintenance strategy non-secondary environment selection standard based on congestion degree analysis.
MOEA/D: multiobjective evolution algorithm based on decomposition. A group of weight vectors uniformly distributed in a target space is adopted to decompose a multi-target optimization problem into a series of single-target optimization subproblems, and the problems are simultaneously evolved together.
KnEA: knee point drive evolution algorithm (inflection point driven evolution algorithm). The optimal solution is selected by adopting environment selection preference based on inflection points, and the method is suitable for high-dimensional multi-objective optimization problems.
In the experiment, the same population size (N is 100 initial solutions), termination conditions (E is 10000 objective function evaluations), random variable seeds and other basic experiment parameters are used for all the algorithms. All test cases for testing the final classification effect of the algorithm were from the online published experimental platforms PlatEMO (https:// githu. com/BIMK/PlatEMO) and UCI machine learning retrieval (http:// actual. ics. UCI. edu/ml/index. php). The classification dataset name, number of categories, total number of features, and total number of samples for each test case are shown in FIG. 5.
The algorithm performance evaluation index adopted in the experiment is an HV (hyper volume) performance index commonly used in the multi-objective evolution algorithm performance evaluation, and the larger the index value is, the better the algorithm performance is, namely, the comprehensive optimal performance of the algorithm on two optimization targets of the classification error rate and the selected characteristic proportion is illustrated. In the aspect of classification effect, a method that 70% of data sets are used for training a model and the rest 30% of data sets are taken out independently for classification effect testing is adopted. A K Nearest Neighbor (KNN) classification model is selected for testing the classification effect of the selected features in the Pareto optimal solution set obtained by the evolutionary algorithm in an experiment on the classification model, the value of a parameter K is 5, and a 10-fold cross validation (10-fold cross-validation) method is adopted to increase the data accuracy. Therefore, in experimental tests, the algorithm with the optimal HV value shows that the classification precision and the classification efficiency are comprehensively optimal.
FIG. 6 shows the classification performance of all four algorithms in the experiment, where the best performing algorithm data is shaded with a grey background to show differentiation. For fairness, each algorithm was tested 20 times in duplicate and statistically significant performance data was collected. In fig. 6, the upper column of each cell data represents the performance mean, and the lower column represents the performance variance, the mean reflects the performance of the algorithm, and the variance illustrates the performance stability of the algorithm to some extent. As can be seen from fig. 6, the performance of the improved multi-objective evolutionary algorithm AIEA based on the adaptive population initialization mechanism proposed in the present application in all the Test data sets is optimal (the average value of the HV index is the largest), and the Wilcoxon signifiance Test (willcoxon signifiance Test) proves that the advantage is huge and significant. Therefore, the self-adaptive population initialization mechanism provided by the application can obviously improve the performance of the multi-objective evolutionary algorithm on the large-scale feature selection optimization problem, greatly improve the classification precision and the classification efficiency of the multi-objective large-scale evolutionary feature selection method, and obtain a better classification effect by using less computing resources.
Referring to fig. 2 to fig. 7, in the present embodiment, a memory device 700 is implemented as follows:
a storage device 700 having stored therein a set of instructions for performing:
step S201: a zero matrix M of n rows and d columns is created, which represents the overall initial solution set. As shown in equation (1), M is used to represent the set of all initial solutions in the population with 01 coding patterns to set the chromosomal locus, and each row in M represents the feature vector of one initial solution.
Step S202: is the feature dimension less than or equal to the population size? If the characteristic dimension is less than or equal to the population size, executing step S203: and initializing the population by adopting a conventional random sampling method. When the feature dimension is smaller than or equal to the population scale, the population size is enough to cover the whole feature space, and at least one solution in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value can be ensured to be covered in a correlated manner, so that the population initialization can be directly carried out by adopting a conventional random sampling method.
The conventional population initialization method will perform discrete evenly distributed random sampling of all solution vectors directly from the horizontal axis of the target space with a value of 0.5, regardless of whether the population size is sufficient to cover a large-scale feature space. In the adaptive population initialization mechanism provided by the application, the step only acts on the condition that D is less than or equal to N, namely the condition that the population size is judged to be enough to cover the whole feature space can be ensured, and at least one solution can be associated and covered in a scene in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value; or on the remaining yet uninitialized solution vectors after the adaptive initialization branch is taken.
If the characteristic dimension is larger than the population size, executing step S204: and calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the population scale and the ratio of the characteristic dimension. The step S204 further includes:
the initial solution number K is calculated by the following formula (2):
K=ceil(N*max(0.2,N/D)) (2)
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size. Thus, the value of K is between 0.2 XN and NX (N/D), i.e. at least 20% of the total population.
Step S205: and calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the population scale and the ratio of the characteristic dimension. The method specifically comprises the following steps:
the symmetry axis H is calculated by the following equation (3):
H=0.5*N/D (3)
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector. Obviously, this range of values lies between 0 and 1. The population distribution symmetry axis of the conventional initial solution random sampling method is generally set to 0.5, i.e., random sampling is distributed in a discrete uniform distribution starting from half of the horizontal axis. According to the self-adaptive initialization method, the boundary of the cross shaft of the target space with the population distribution symmetric axial numerical value of 0 is moved in a self-adaptive mode according to the ratio of the population scale N to the feature scale D, so that the search focus and the region of interest corresponding to the sub-population adopting the self-adaptive initialization are moved forward, and feasible solutions with fewer selected features are searched by concentrated force. The benefits of this are mainly two: 1, the selected characteristic number is reduced, so that the range of feasible solutions is greatly reduced, the difficulty of population convergence is greatly reduced, and the population convergence speed is improved; 2, a small amount of elite characteristics are searched by centralized computing resources, a large amount of redundant and irrelevant characteristics in a large-scale characteristic space are avoided, and the searching efficiency of the algorithm is greatly improved.
Step S206: and carrying out self-adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis. The method specifically comprises the following steps:
solving vector X (X) by a certain line1,x2,x3,...,xD) For example, the specific method is as follows: for each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected; and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
Step S207: and randomly sampling all the solution vectors remained from the K +1 th row to the end in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S203. That is, all the discrete uniform distribution random samples are taken from the horizontal axis of the target space with the value of 0.5, the basic flow is similar to step S205, but the parameter of the symmetry axis of the population distribution is changed from H to 0.5.
Step S208: and finishing population initialization.
Performing the following steps by an instruction set: the initial solution set generated by adopting the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process. And further, the population convergence speed of the multi-target evolutionary feature selection algorithm is increased, and the final classification effect and quality of the multi-target optimization solution set are improved.
The beneficial effects of the adaptive population initialization mechanism of the present application are further explained with actual data as follows:
as shown in fig. 3, the initial solution set distribution of the initial population of NSGA-II in two target spaces using the conventional population initialization method is illustrated, where target 1 (horizontal axis) represents the proportion of the selected features to the total features, and target 2 (vertical axis) represents the classification error rate of the selected feature subset in the classifier classification, and obviously, target 1 and target 2 need to be as small as possible.
As shown in fig. 4, the algorithm AIEA using the adaptive population initialization mechanism proposed in the present application is illustrated. The AIEA algorithm is an abbreviation of adaptive initialization based evolution algorithm, and is a new algorithm constructed after the proposed adaptive population initialization mechanism is migrated to the original NSGA-II algorithm architecture, that is, the adaptive population initialization mechanism of fig. 2 replaces the population initialization process of step 1 of fig. 1. Compared with fig. 3, the initial solution set of fig. 4 is divided into two sub-populations respectively distributed in the front and back sub-search regions of the target space. Therefore, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has an advantage in breadth in the searching process.
Moreover, the initial solution of the top sub-population in fig. 4 selects fewer features than that in fig. 3, which means that the search area needs to search for fewer combinations of features, which means that the search difficulty and the population convergence difficulty are greatly reduced. Therefore, the sub-population generated by the adaptive initialization mechanism can complete convergence more quickly, thereby driving the sub-population initialized in a conventional manner to search the feature space together with higher efficiency.
In order to further verify the effect of the adaptive population initialization mechanism provided by the application, a plurality of different multi-objective evolutionary algorithms which are most classical and popular at present are introduced, and the method is compared with the multi-objective evolutionary method (namely AIEA) based on the adaptive population initialization mechanism. The comparison algorithm introduced is as follows:
NSGA-II: a fast and elitist multiobjective genetic algorithm NSGA-II (a fast and excellent multi-objective genetic algorithm: NSGA-II). The method adopts non-domination ordering based on Pareto domination relation as a primary environment selection standard, and is assisted by a diversity maintenance strategy non-secondary environment selection standard based on congestion degree analysis.
MOEA/D: multiobjective evolution algorithm based on decomposition. A group of weight vectors uniformly distributed in a target space is adopted to decompose a multi-target optimization problem into a series of single-target optimization subproblems, and the problems are simultaneously evolved together.
KnEA: knee point drive evolution algorithm (inflection point driven evolution algorithm). The optimal solution is selected by adopting environment selection preference based on inflection points, and the method is suitable for high-dimensional multi-objective optimization problems.
In the experiment, the same population size (N is 100 initial solutions), termination conditions (E is 10000 objective function evaluations), random variable seeds and other basic experiment parameters are used for all the algorithms. All test cases for testing the final classification effect of the algorithm were from the online published experimental platforms PlatEMO (https:// githu. com/BIMK/PlatEMO) and UCI machine learning retrieval (http:// actual. ics. UCI. edu/ml/index. php). The classification dataset name, number of categories, total number of features, and total number of samples for each test case are shown in FIG. 5.
The algorithm performance evaluation index adopted in the experiment is an HV (hyper volume) performance index commonly used in the multi-objective evolution algorithm performance evaluation, and the larger the index value is, the better the algorithm performance is, namely, the comprehensive optimal performance of the algorithm on two optimization targets of the classification error rate and the selected characteristic proportion is illustrated. In the aspect of classification effect, a method that 70% of data sets are used for training a model and the rest 30% of data sets are taken out independently for classification effect testing is adopted. A K Nearest Neighbor (KNN) classification model is selected for testing the classification effect of the selected features in the Pareto optimal solution set obtained by the evolutionary algorithm in an experiment on the classification model, the value of a parameter K is 5, and a 10-fold cross validation (10-fold cross-validation) method is adopted to increase the data accuracy. Therefore, in experimental tests, the algorithm with the optimal HV value shows that the classification precision and the classification efficiency are comprehensively optimal.
FIG. 6 shows the classification performance of all four algorithms in the experiment, where the best performing algorithm data is shaded with a grey background to show differentiation. For fairness, each algorithm was tested 20 times in duplicate and statistically significant performance data was collected. In fig. 6, the upper column of each cell data represents the performance mean, and the lower column represents the performance variance, the mean reflects the performance of the algorithm, and the variance illustrates the performance stability of the algorithm to some extent. As can be seen from fig. 6, the performance of the improved multi-objective evolutionary algorithm AIEA based on the adaptive population initialization mechanism proposed in the present application in all the Test data sets is optimal (the average value of the HV index is the largest), and the Wilcoxon signifiance Test (willcoxon signifiance Test) proves that the advantage is huge and significant. Therefore, the self-adaptive population initialization mechanism provided by the application can obviously improve the performance of the multi-objective evolutionary algorithm on the large-scale feature selection optimization problem, greatly improve the classification precision and the classification efficiency of the multi-objective large-scale evolutionary feature selection method, and obtain a better classification effect by using less computing resources.
It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.
Claims (10)
1. The self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm is characterized by comprising the following steps of:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
2. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S101 further comprises the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
3. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S104 further comprises the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size.
4. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S105 further comprises the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
5. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S106 further comprises the steps of:
a certain row solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1,if r is less than H, xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
6. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S103 further comprises the steps of:
all solution vectors are discretely evenly distributed and randomly sampled from the horizontal axis of the target space with a value of 0.5.
7. A storage device having a set of instructions stored therein, the set of instructions being operable to perform:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
8. The storage device of claim 7, wherein the set of instructions is further configured to perform: the step S101 is preceded by the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
9. The storage device of claim 7, wherein the set of instructions is further configured to perform: the step S104 specifically further includes the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population scale, and D represents the characteristic scale;
the step S105 specifically further includes the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
10. The storage device of claim 7, wherein the set of instructions is further configured to perform: the step S106 further includes:
a certain oneRow solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110856379.6A CN113554144A (en) | 2021-07-28 | 2021-07-28 | Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110856379.6A CN113554144A (en) | 2021-07-28 | 2021-07-28 | Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113554144A true CN113554144A (en) | 2021-10-26 |
Family
ID=78104739
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110856379.6A Pending CN113554144A (en) | 2021-07-28 | 2021-07-28 | Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113554144A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114202387A (en) * | 2021-12-17 | 2022-03-18 | 安徽大学 | Commodity recommendation method based on large-scale evolutionary algorithm |
-
2021
- 2021-07-28 CN CN202110856379.6A patent/CN113554144A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114202387A (en) * | 2021-12-17 | 2022-03-18 | 安徽大学 | Commodity recommendation method based on large-scale evolutionary algorithm |
CN114202387B (en) * | 2021-12-17 | 2024-02-20 | 安徽大学 | Commodity recommendation method based on large-scale evolutionary algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | An evolutionary multitasking-based feature selection method for high-dimensional classification | |
CN103745258B (en) | Complex network community mining method based on the genetic algorithm of minimum spanning tree cluster | |
Han et al. | Competition-driven multimodal multiobjective optimization and its application to feature selection for credit card fraud detection | |
CN113407185B (en) | Compiler optimization option recommendation method based on Bayesian optimization | |
Wan et al. | Neural network-based multiobjective optimization algorithm for nonlinear beam dynamics | |
CN115469851A (en) | Automatic parameter adjusting method for compiler | |
CN113344174A (en) | Efficient neural network structure searching method based on probability distribution | |
CN113554144A (en) | Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm | |
CN110738362A (en) | method for constructing prediction model based on improved multivariate cosmic algorithm | |
CN112819062B (en) | Fluorescence spectrum secondary feature selection method based on mixed particle swarm and continuous projection | |
Phan et al. | Efficiency enhancement of evolutionary neural architecture search via training-free initialization | |
CN111176865B (en) | Peer-to-peer mode parallel processing method and framework based on optimization algorithm | |
CN114819151A (en) | Biochemical path planning method based on improved agent-assisted shuffled frog leaping algorithm | |
CN110020725B (en) | Test design method for weapon equipment system combat simulation | |
Rahati et al. | Ensembles strategies for backtracking search algorithm with application to engineering design optimization problems | |
Yang et al. | A hybrid evolutionary algorithm for finding pareto optimal set in multi-objective optimization | |
Kaya et al. | A novel multi-objective genetic algorithm for multiple sequence alignment | |
Hadjiivanov et al. | Epigenetic evolution of deep convolutional models | |
Owen et al. | Adapting particle swarm optimisation for fitness landscapes with neutrality | |
Wu et al. | An improved genetic algorithm based on explosion mechanism | |
Sathyapriya et al. | Survey on N-Queen Problem with Genetic Algorithm | |
Dhivya et al. | Weighted particle swarm optimization algorithm for randomized unit testing | |
CN113780146B (en) | Hyperspectral image classification method and system based on lightweight neural architecture search | |
Li et al. | Surrogate-Assisted Evolution of Convolutional Neural Networks by Collaboratively Optimizing the Basic Blocks and Topologies | |
Ito et al. | OFA 2: A Multi-Objective Perspective for the Once-for-All Neural Architecture Search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |