CN113554144A - Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm - Google Patents

Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm Download PDF

Info

Publication number
CN113554144A
CN113554144A CN202110856379.6A CN202110856379A CN113554144A CN 113554144 A CN113554144 A CN 113554144A CN 202110856379 A CN202110856379 A CN 202110856379A CN 113554144 A CN113554144 A CN 113554144A
Authority
CN
China
Prior art keywords
population
adaptive
initialization
self
initial solution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110856379.6A
Other languages
Chinese (zh)
Inventor
徐航
闻辉
林元模
许荣斌
严涛
黄淋云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Putian University
Original Assignee
Putian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Putian University filed Critical Putian University
Priority to CN202110856379.6A priority Critical patent/CN113554144A/en
Publication of CN113554144A publication Critical patent/CN113554144A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Abstract

The invention relates to the technical field of algorithm optimization, in particular to a self-adaptive population initialization method and storage equipment for a multi-objective evolutionary feature selection algorithm. The self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm comprises the following steps: creating a zero matrix M with n rows and d columns; and comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is larger than the population scale, performing self-adaptive initialization on the initial solution quantity K, calculating a sub-population distribution symmetry axis H of the initial solution set needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, performing self-adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking H as the symmetry axis, and finishing the population initialization. By the method, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process.

Description

Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm
Technical Field
The invention relates to the technical field of algorithm optimization, in particular to a self-adaptive population initialization method and storage equipment for a multi-objective evolutionary feature selection algorithm.
Background
In recent years, China gradually enters a big data and artificial intelligence era, and the requirement of people on large-scale data classification is increasing day by day. However, such data often has a huge amount of features, and the features include not only related features, but also many unrelated and redundant features, and complex interactions between features, which all bring difficulties to the learning of the classification model. Feature selection is classified by identifying and screening a part of features from all features, is a general means for solving the above problems, is also a necessary path for big data processing, belongs to a non-convex and discrete complex optimization problem, and generally has the following two key optimization objectives (which are also the optimization objectives concerned by the patent research): 1, minimizing the classification error rate; and 2, minimizing the proportion of the selected features to the total features (called the selected feature rate for short). However, with the increasing feature number, the induced dimensionality disaster not only causes the exponential increase of the search space for feature selection, the increase of local optimal traps and infeasible solution areas, which seriously affects the search efficiency and convergence rate of the feature selection algorithm, but also causes the interaction among features to tend to be complex, various irrelevant and redundant features become more, so that the difference among the selected feature subsets becomes smaller, and the repeated and redundant solutions increase. These problems do not challenge the limit of the existing feature selection algorithm, it is obviously unrealistic to search by an exhaustion method, and most heuristic search techniques, such as simulated annealing, tabu search and greedy algorithm, cannot effectively process a huge and complex feature search space. Therefore, a new feature selection method is needed to meet the increasing difficulty of large-scale feature search and the requirement of high-dimensional data classification.
The evolutionary algorithm is a meta-heuristic search method derived from Darwinian culture evolution hypothesis and the law of winning or losing nature. The method realizes iterative optimization updating of the population through three main evolutionary stages of simulating population initialization, offspring propagation and environment selection so as to achieve global search of a Pareto optimal solution set. It has the following advantages: 1, an optimized object is taken as a 'black box' problem without making any assumption on data; 2, the method does not need domain knowledge, but can be combined with the existing methods (such as local search) in other fields to form a better mixed search method; 3, complex parameter setting is not needed, the structure is simple, and the operation and the improvement are easy; and 4, searching the solution in a population mode, and simultaneously obtaining a feasible solution of a set in one iteration, thereby not only improving the searching efficiency, but also being beneficial to jumping out the local optimal trap. Owing to the above characteristics, especially the group intelligent search mode, the evolutionary algorithm is widely applied to solving the multi-objective optimization problem, and is also a powerful means for solving the large-scale multi-objective feature selection problem. However, most of the existing evolutionary algorithms for studying the large-scale multi-target feature selection problem focus on two evolutionary stages of offspring propagation and environment selection, and few scholars develop more intensive research specifically aiming at the evolutionary characteristics of the population initialization stage. The method is characterized in that a currently more classical and commonly used multi-objective evolutionary algorithm NSGA-II: for example, A fast and elitist multi-objective genetic algorithm NSGA-II (a fast and excellent multi-objective genetic algorithm NSGA-II), the main flow chart is shown in FIG. 1. Wherein, step 1 is a population initialization phase, steps 2 and 3 are offspring breeding phases, and steps 4, 5 and 6 are environment selection phases.
As shown in fig. 1, the conventional non-dominated sorting based NSGA-II multi-objective evolutionary algorithm does not perform any special processing and design on the random initialization process of the population according to the problem characteristics of large-scale feature selection. However, for large scale feature spaces, the size of the population is often limited. For example, a feature selection problem of 10000 feature numbers is limited by computing resources and computing time, and it is likely that only populations of 100 solutions are used to search for optimizations. Under the above circumstances, the randomly sampled initial population cannot cover the feature space far beyond its own scale at all, and even if the population converges towards the ideal Pareto front curve along with the evolution of the evolution, the effective feature scale to be covered is reduced, it is often difficult to make each selected feature area have a feasible solution corresponding to it. The problem is a weak link and a research blind area in the field of large-scale multi-objective evolutionary feature selection research, and is a technical bottleneck which must be broken in designing a large-scale multi-objective evolutionary feature selection algorithm with high efficiency and high precision.
Disclosure of Invention
Therefore, a self-adaptive population initialization method for a multi-objective evolutionary feature selection algorithm is needed to be provided, so as to solve the technical problems of low population convergence rate and poor final classification effect and quality of a multi-objective optimization solution set of the multi-objective evolutionary feature selection algorithm. The specific technical scheme is as follows:
the self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm comprises the following steps:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
Further, before the step S101, the method specifically includes the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
Further, the step S104 specifically includes the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size.
Further, the step S105 specifically includes the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
Further, the step S106 specifically includes the steps of:
a certain row solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, x i0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
Further, the step S103 specifically includes the steps of:
all solution vectors are discretely evenly distributed and randomly sampled from the horizontal axis of the target space with a value of 0.5.
In order to solve the technical problem, the storage device is further provided, and the specific technical scheme is as follows:
a storage device having stored therein a set of instructions for performing:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
Further, the set of instructions is further for performing: the step S101 is preceded by the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
Further, the set of instructions is further for performing: the step S104 specifically further includes the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population scale, and D represents the characteristic scale;
the step S105 specifically further includes the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
Further, the set of instructions is further for performing: the step S106 further includes:
a certain row solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
The invention has the beneficial effects that: the self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm comprises the following steps: step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set; step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104; step S103: initializing the population by adopting a conventional random sampling method, and entering step S108; step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105; step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106; step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107; step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108; step S108: and finishing population initialization. By the method, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process. And further, the population convergence speed of the multi-target evolutionary feature selection algorithm is increased, and the final classification effect and quality of the multi-target optimization solution set are improved.
Drawings
FIG. 1 is a flow chart of a conventional NSGA-II multi-objective evolutionary algorithm based on non-dominated sorting in the background art;
FIG. 2 is a flow chart of a method for adaptive population initialization for a multi-objective evolutionary feature selection algorithm, according to an embodiment;
FIG. 3 is an example of initial solution set distribution under a conventional population initialization mechanism used by the conventional evolutionary algorithm according to the embodiment;
FIG. 4 is an exemplary distribution of initial solution sets under the adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to the embodiment;
FIG. 5 is a diagram illustrating classification dataset names, feature quantities, sample quantities, and category numbers for each test case according to an embodiment;
FIG. 6 is a diagram illustrating a comparison of performance of an adaptive population initialization method according to an embodiment with other classical algorithms;
FIG. 7 is a block diagram of a storage device according to an embodiment.
Description of reference numerals:
700. a storage device.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
Referring to fig. 2 to 6, in the present embodiment, the adaptive population initialization method for the multi-objective evolutionary feature selection algorithm may be applied to a storage device, including but not limited to: personal computers, servers, general purpose computers, special purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, etc.
The core technical idea of the application is as follows:
the specific technical scheme is as follows: the method mainly aims at improving the population initialization stage in three evolution stages (population initialization, offspring propagation and environment selection) of the multi-objective evolutionary algorithm, and designs a self-adaptive population initialization mechanism which adaptively adjusts the random distribution position and the rule of an initial solution in a feature space according to the proportional relation between the population size and the dimension of the feature space, so that the population convergence speed of the multi-objective evolutionary feature selection algorithm is improved, and the final classification effect and quality of a multi-objective optimized solution set are improved. First, theThe mechanism adopts a 01 coding mode to carry out gene coding on a solution vector, namely: a genetic locus with a value of 0 indicates that the corresponding feature was not selected, and a genetic locus with a value of 1 indicates that the corresponding feature was selected. Taking a feature space with dimension D as an example, the feature vector dimension of the space is also D, and it is assumed that a solution vector in the space is X ═ X (X)1,x2,x3,...,xD) Then xi0 means that the ith feature is not selected, and xi1 indicates that the ith feature is selected. Therefore, all the gene sets with the numerical value of 1 in the solution vector represent the selected feature subset in the feature space, and how to obtain higher classification accuracy, namely the final optimization goal of the multi-objective evolutionary feature selection algorithm, by using fewer selected features.
After determining 01 encoding mode, the adaptive population initialization mechanism used in the present application is specifically implemented as follows
Step S201: a zero matrix M of n rows and d columns is created, which represents the overall initial solution set. As shown in equation (1), M is used to represent the set of all initial solutions in the population with 01 coding patterns to set the chromosomal locus, and each row in M represents the feature vector of one initial solution.
Figure BDA0003184150280000081
Step S202: is the feature dimension less than or equal to the population size? If the characteristic dimension is less than or equal to the population size, executing step S203: and initializing the population by adopting a conventional random sampling method. When the feature dimension is smaller than or equal to the population scale, the population size is enough to cover the whole feature space, and at least one solution in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value can be ensured to be covered in a correlated manner, so that the population initialization can be directly carried out by adopting a conventional random sampling method.
The conventional population initialization method will perform discrete evenly distributed random sampling of all solution vectors directly from the horizontal axis of the target space with a value of 0.5, regardless of whether the population size is sufficient to cover a large-scale feature space. In the adaptive population initialization mechanism provided by the application, the step only acts on the condition that D is less than or equal to N, namely the condition that the population size is judged to be enough to cover the whole feature space can be ensured, and at least one solution can be associated and covered in a scene in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value; or on the remaining yet uninitialized solution vectors after the adaptive initialization branch is taken.
If the characteristic dimension is larger than the population size, executing step S204: and calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the population scale and the ratio of the characteristic dimension. The step S204 further includes:
the initial solution number K is calculated by the following formula (2):
K=ceil(N*max(0.2,N/D)) (2)
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size. Thus, the value of K is between 0.2 XN and NX (N/D), i.e. at least 20% of the total population.
Step S205: and calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the population scale and the ratio of the characteristic dimension. The method specifically comprises the following steps:
the symmetry axis H is calculated by the following equation (3):
H=0.5*N/D (3)
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector. Obviously, this range of values lies between 0 and 1. The population distribution symmetry axis of the conventional initial solution random sampling method is generally set to 0.5, i.e., random sampling is distributed in a discrete uniform distribution starting from half of the horizontal axis. According to the self-adaptive initialization method, the boundary of the cross shaft of the target space with the population distribution symmetric axial numerical value of 0 is moved in a self-adaptive mode according to the ratio of the population scale N to the feature scale D, so that the search focus and the region of interest corresponding to the sub-population adopting the self-adaptive initialization are moved forward, and feasible solutions with fewer selected features are searched by concentrated force. The benefits of this are mainly two: 1, the selected characteristic number is reduced, so that the range of feasible solutions is greatly reduced, the difficulty of population convergence is greatly reduced, and the population convergence speed is improved; 2, a small amount of elite characteristics are searched by centralized computing resources, a large amount of redundant and irrelevant characteristics in a large-scale characteristic space are avoided, and the searching efficiency of the algorithm is greatly improved.
Step S206: and carrying out self-adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis. The method specifically comprises the following steps:
solving vector X (X) by a certain line1,x2,x3,...,xD) For example, the specific method is as follows: for each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected; and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
Step S207: and randomly sampling all the solution vectors remained from the K +1 th row to the end in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S203. That is, all the discrete uniform distribution random samples are taken from the horizontal axis of the target space with the value of 0.5, the basic flow is similar to step S205, but the parameter of the symmetry axis of the population distribution is changed from H to 0.5.
Step S208: and finishing population initialization.
By the method, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process. And further, the population convergence speed of the multi-target evolutionary feature selection algorithm is increased, and the final classification effect and quality of the multi-target optimization solution set are improved.
The beneficial effects of the adaptive population initialization mechanism of the present application are further explained with actual data as follows:
as shown in fig. 3, the initial solution set distribution of the initial population of NSGA-II in two target spaces using the conventional population initialization method is illustrated, where target 1 (horizontal axis) represents the proportion of the selected features to the total features, and target 2 (vertical axis) represents the classification error rate of the selected feature subset in the classifier classification, and obviously, target 1 and target 2 need to be as small as possible.
As shown in fig. 4, the algorithm AIEA using the adaptive population initialization mechanism proposed in the present application is illustrated. The AIEA algorithm is an abbreviation of adaptive initialization based evolution algorithm, and is a new algorithm constructed after the proposed adaptive population initialization mechanism is migrated to the original NSGA-II algorithm architecture, that is, the adaptive population initialization mechanism of fig. 2 replaces the population initialization process of step 1 of fig. 1. Compared with fig. 3, the initial solution set of fig. 4 is divided into two sub-populations respectively distributed in the front and back sub-search regions of the target space. Therefore, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has an advantage in breadth in the searching process.
Moreover, the initial solution of the top sub-population in fig. 4 selects fewer features than that in fig. 3, which means that the search area needs to search for fewer combinations of features, which means that the search difficulty and the population convergence difficulty are greatly reduced. Therefore, the sub-population generated by the adaptive initialization mechanism can complete convergence more quickly, thereby driving the sub-population initialized in a conventional manner to search the feature space together with higher efficiency.
In order to further verify the effect of the adaptive population initialization mechanism provided by the application, a plurality of different multi-objective evolutionary algorithms which are most classical and popular at present are introduced, and the method is compared with the multi-objective evolutionary method (namely AIEA) based on the adaptive population initialization mechanism. The comparison algorithm introduced is as follows:
NSGA-II: a fast and elitist multiobjective genetic algorithm NSGA-II (a fast and excellent multi-objective genetic algorithm: NSGA-II). The method adopts non-domination ordering based on Pareto domination relation as a primary environment selection standard, and is assisted by a diversity maintenance strategy non-secondary environment selection standard based on congestion degree analysis.
MOEA/D: multiobjective evolution algorithm based on decomposition. A group of weight vectors uniformly distributed in a target space is adopted to decompose a multi-target optimization problem into a series of single-target optimization subproblems, and the problems are simultaneously evolved together.
KnEA: knee point drive evolution algorithm (inflection point driven evolution algorithm). The optimal solution is selected by adopting environment selection preference based on inflection points, and the method is suitable for high-dimensional multi-objective optimization problems.
In the experiment, the same population size (N is 100 initial solutions), termination conditions (E is 10000 objective function evaluations), random variable seeds and other basic experiment parameters are used for all the algorithms. All test cases for testing the final classification effect of the algorithm were from the online published experimental platforms PlatEMO (https:// githu. com/BIMK/PlatEMO) and UCI machine learning retrieval (http:// actual. ics. UCI. edu/ml/index. php). The classification dataset name, number of categories, total number of features, and total number of samples for each test case are shown in FIG. 5.
The algorithm performance evaluation index adopted in the experiment is an HV (hyper volume) performance index commonly used in the multi-objective evolution algorithm performance evaluation, and the larger the index value is, the better the algorithm performance is, namely, the comprehensive optimal performance of the algorithm on two optimization targets of the classification error rate and the selected characteristic proportion is illustrated. In the aspect of classification effect, a method that 70% of data sets are used for training a model and the rest 30% of data sets are taken out independently for classification effect testing is adopted. A K Nearest Neighbor (KNN) classification model is selected for testing the classification effect of the selected features in the Pareto optimal solution set obtained by the evolutionary algorithm in an experiment on the classification model, the value of a parameter K is 5, and a 10-fold cross validation (10-fold cross-validation) method is adopted to increase the data accuracy. Therefore, in experimental tests, the algorithm with the optimal HV value shows that the classification precision and the classification efficiency are comprehensively optimal.
FIG. 6 shows the classification performance of all four algorithms in the experiment, where the best performing algorithm data is shaded with a grey background to show differentiation. For fairness, each algorithm was tested 20 times in duplicate and statistically significant performance data was collected. In fig. 6, the upper column of each cell data represents the performance mean, and the lower column represents the performance variance, the mean reflects the performance of the algorithm, and the variance illustrates the performance stability of the algorithm to some extent. As can be seen from fig. 6, the performance of the improved multi-objective evolutionary algorithm AIEA based on the adaptive population initialization mechanism proposed in the present application in all the Test data sets is optimal (the average value of the HV index is the largest), and the Wilcoxon signifiance Test (willcoxon signifiance Test) proves that the advantage is huge and significant. Therefore, the self-adaptive population initialization mechanism provided by the application can obviously improve the performance of the multi-objective evolutionary algorithm on the large-scale feature selection optimization problem, greatly improve the classification precision and the classification efficiency of the multi-objective large-scale evolutionary feature selection method, and obtain a better classification effect by using less computing resources.
Referring to fig. 2 to fig. 7, in the present embodiment, a memory device 700 is implemented as follows:
a storage device 700 having stored therein a set of instructions for performing:
step S201: a zero matrix M of n rows and d columns is created, which represents the overall initial solution set. As shown in equation (1), M is used to represent the set of all initial solutions in the population with 01 coding patterns to set the chromosomal locus, and each row in M represents the feature vector of one initial solution.
Figure BDA0003184150280000131
Step S202: is the feature dimension less than or equal to the population size? If the characteristic dimension is less than or equal to the population size, executing step S203: and initializing the population by adopting a conventional random sampling method. When the feature dimension is smaller than or equal to the population scale, the population size is enough to cover the whole feature space, and at least one solution in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value can be ensured to be covered in a correlated manner, so that the population initialization can be directly carried out by adopting a conventional random sampling method.
The conventional population initialization method will perform discrete evenly distributed random sampling of all solution vectors directly from the horizontal axis of the target space with a value of 0.5, regardless of whether the population size is sufficient to cover a large-scale feature space. In the adaptive population initialization mechanism provided by the application, the step only acts on the condition that D is less than or equal to N, namely the condition that the population size is judged to be enough to cover the whole feature space can be ensured, and at least one solution can be associated and covered in a scene in a search sub-area corresponding to the size of the selected feature subset of each specific numerical value; or on the remaining yet uninitialized solution vectors after the adaptive initialization branch is taken.
If the characteristic dimension is larger than the population size, executing step S204: and calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the population scale and the ratio of the characteristic dimension. The step S204 further includes:
the initial solution number K is calculated by the following formula (2):
K=ceil(N*max(0.2,N/D)) (2)
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size. Thus, the value of K is between 0.2 XN and NX (N/D), i.e. at least 20% of the total population.
Step S205: and calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the population scale and the ratio of the characteristic dimension. The method specifically comprises the following steps:
the symmetry axis H is calculated by the following equation (3):
H=0.5*N/D (3)
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector. Obviously, this range of values lies between 0 and 1. The population distribution symmetry axis of the conventional initial solution random sampling method is generally set to 0.5, i.e., random sampling is distributed in a discrete uniform distribution starting from half of the horizontal axis. According to the self-adaptive initialization method, the boundary of the cross shaft of the target space with the population distribution symmetric axial numerical value of 0 is moved in a self-adaptive mode according to the ratio of the population scale N to the feature scale D, so that the search focus and the region of interest corresponding to the sub-population adopting the self-adaptive initialization are moved forward, and feasible solutions with fewer selected features are searched by concentrated force. The benefits of this are mainly two: 1, the selected characteristic number is reduced, so that the range of feasible solutions is greatly reduced, the difficulty of population convergence is greatly reduced, and the population convergence speed is improved; 2, a small amount of elite characteristics are searched by centralized computing resources, a large amount of redundant and irrelevant characteristics in a large-scale characteristic space are avoided, and the searching efficiency of the algorithm is greatly improved.
Step S206: and carrying out self-adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis. The method specifically comprises the following steps:
solving vector X (X) by a certain line1,x2,x3,...,xD) For example, the specific method is as follows: for each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected; and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
Step S207: and randomly sampling all the solution vectors remained from the K +1 th row to the end in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S203. That is, all the discrete uniform distribution random samples are taken from the horizontal axis of the target space with the value of 0.5, the basic flow is similar to step S205, but the parameter of the symmetry axis of the population distribution is changed from H to 0.5.
Step S208: and finishing population initialization.
Performing the following steps by an instruction set: the initial solution set generated by adopting the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has the advantage of breadth in the searching process. And further, the population convergence speed of the multi-target evolutionary feature selection algorithm is increased, and the final classification effect and quality of the multi-target optimization solution set are improved.
The beneficial effects of the adaptive population initialization mechanism of the present application are further explained with actual data as follows:
as shown in fig. 3, the initial solution set distribution of the initial population of NSGA-II in two target spaces using the conventional population initialization method is illustrated, where target 1 (horizontal axis) represents the proportion of the selected features to the total features, and target 2 (vertical axis) represents the classification error rate of the selected feature subset in the classifier classification, and obviously, target 1 and target 2 need to be as small as possible.
As shown in fig. 4, the algorithm AIEA using the adaptive population initialization mechanism proposed in the present application is illustrated. The AIEA algorithm is an abbreviation of adaptive initialization based evolution algorithm, and is a new algorithm constructed after the proposed adaptive population initialization mechanism is migrated to the original NSGA-II algorithm architecture, that is, the adaptive population initialization mechanism of fig. 2 replaces the population initialization process of step 1 of fig. 1. Compared with fig. 3, the initial solution set of fig. 4 is divided into two sub-populations respectively distributed in the front and back sub-search regions of the target space. Therefore, the initial solution set generated by the self-adaptive population initialization mechanism can cover a wider range of feature space, and naturally has an advantage in breadth in the searching process.
Moreover, the initial solution of the top sub-population in fig. 4 selects fewer features than that in fig. 3, which means that the search area needs to search for fewer combinations of features, which means that the search difficulty and the population convergence difficulty are greatly reduced. Therefore, the sub-population generated by the adaptive initialization mechanism can complete convergence more quickly, thereby driving the sub-population initialized in a conventional manner to search the feature space together with higher efficiency.
In order to further verify the effect of the adaptive population initialization mechanism provided by the application, a plurality of different multi-objective evolutionary algorithms which are most classical and popular at present are introduced, and the method is compared with the multi-objective evolutionary method (namely AIEA) based on the adaptive population initialization mechanism. The comparison algorithm introduced is as follows:
NSGA-II: a fast and elitist multiobjective genetic algorithm NSGA-II (a fast and excellent multi-objective genetic algorithm: NSGA-II). The method adopts non-domination ordering based on Pareto domination relation as a primary environment selection standard, and is assisted by a diversity maintenance strategy non-secondary environment selection standard based on congestion degree analysis.
MOEA/D: multiobjective evolution algorithm based on decomposition. A group of weight vectors uniformly distributed in a target space is adopted to decompose a multi-target optimization problem into a series of single-target optimization subproblems, and the problems are simultaneously evolved together.
KnEA: knee point drive evolution algorithm (inflection point driven evolution algorithm). The optimal solution is selected by adopting environment selection preference based on inflection points, and the method is suitable for high-dimensional multi-objective optimization problems.
In the experiment, the same population size (N is 100 initial solutions), termination conditions (E is 10000 objective function evaluations), random variable seeds and other basic experiment parameters are used for all the algorithms. All test cases for testing the final classification effect of the algorithm were from the online published experimental platforms PlatEMO (https:// githu. com/BIMK/PlatEMO) and UCI machine learning retrieval (http:// actual. ics. UCI. edu/ml/index. php). The classification dataset name, number of categories, total number of features, and total number of samples for each test case are shown in FIG. 5.
The algorithm performance evaluation index adopted in the experiment is an HV (hyper volume) performance index commonly used in the multi-objective evolution algorithm performance evaluation, and the larger the index value is, the better the algorithm performance is, namely, the comprehensive optimal performance of the algorithm on two optimization targets of the classification error rate and the selected characteristic proportion is illustrated. In the aspect of classification effect, a method that 70% of data sets are used for training a model and the rest 30% of data sets are taken out independently for classification effect testing is adopted. A K Nearest Neighbor (KNN) classification model is selected for testing the classification effect of the selected features in the Pareto optimal solution set obtained by the evolutionary algorithm in an experiment on the classification model, the value of a parameter K is 5, and a 10-fold cross validation (10-fold cross-validation) method is adopted to increase the data accuracy. Therefore, in experimental tests, the algorithm with the optimal HV value shows that the classification precision and the classification efficiency are comprehensively optimal.
FIG. 6 shows the classification performance of all four algorithms in the experiment, where the best performing algorithm data is shaded with a grey background to show differentiation. For fairness, each algorithm was tested 20 times in duplicate and statistically significant performance data was collected. In fig. 6, the upper column of each cell data represents the performance mean, and the lower column represents the performance variance, the mean reflects the performance of the algorithm, and the variance illustrates the performance stability of the algorithm to some extent. As can be seen from fig. 6, the performance of the improved multi-objective evolutionary algorithm AIEA based on the adaptive population initialization mechanism proposed in the present application in all the Test data sets is optimal (the average value of the HV index is the largest), and the Wilcoxon signifiance Test (willcoxon signifiance Test) proves that the advantage is huge and significant. Therefore, the self-adaptive population initialization mechanism provided by the application can obviously improve the performance of the multi-objective evolutionary algorithm on the large-scale feature selection optimization problem, greatly improve the classification precision and the classification efficiency of the multi-objective large-scale evolutionary feature selection method, and obtain a better classification effect by using less computing resources.
It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims (10)

1. The self-adaptive population initialization method for the multi-objective evolutionary feature selection algorithm is characterized by comprising the following steps of:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
2. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S101 further comprises the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
3. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S104 further comprises the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population size, and D represents the characteristic size.
4. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S105 further comprises the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
5. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S106 further comprises the steps of:
a certain row solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1,if r is less than H, xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
6. The adaptive population initialization method for the multi-objective evolutionary feature selection algorithm according to claim 1, wherein the step S103 further comprises the steps of:
all solution vectors are discretely evenly distributed and randomly sampled from the horizontal axis of the target space with a value of 0.5.
7. A storage device having a set of instructions stored therein, the set of instructions being operable to perform:
step S101: creating a zero matrix M with n rows and d columns, wherein the M represents an overall initial solution set;
step S102: comparing the size between the characteristic dimension and the population scale, if the characteristic dimension is smaller than or equal to the population scale, executing a step S103, and if the characteristic dimension is larger than the population scale, jumping to a step S104;
step S103: initializing the population by adopting a conventional random sampling method, and entering step S108;
step S104: calculating the initial solution quantity K which needs to be subjected to self-adaptive initialization in the whole population according to the ratio of the population scale to the characteristic dimension, and entering the step S105;
step S105: calculating a sub-population distribution symmetry axis H of the initial solution subset needing self-adaptive initialization in the target space according to the ratio of the population scale to the characteristic dimension, and entering step S106;
step S106: performing adaptive sampling on the first K rows of initial solution vectors in the initial solution set represented by the zero matrix M by taking the H as a symmetry axis, and entering step S107;
step S107: randomly sampling all the rest solution vectors from the K +1 th row to the tail in the initial solution set represented by the matrix M by adopting the conventional population initialization method in the step S103, and entering the step S108;
step S108: and finishing population initialization.
8. The storage device of claim 7, wherein the set of instructions is further configured to perform: the step S101 is preceded by the steps of:
adopting a preset coding mode to carry out gene coding on the solution vector;
the method for performing gene coding on the solution vector by adopting the preset coding mode specifically comprises the following steps: and (3) carrying out gene coding on the solution vector by adopting a 01 coding mode, wherein the gene bit with the value of 0 represents that the corresponding characteristic is not selected, and the gene bit with the value of 1 represents that the corresponding characteristic is selected.
9. The storage device of claim 7, wherein the set of instructions is further configured to perform: the step S104 specifically further includes the steps of:
the initial number of solutions K is calculated by the following formula:
K=ceil(N*max(0.2,N/D))
wherein the ceil function represents taking the lower bound of an integer value, the max function represents taking the maximum number of two parameter values, N represents the population scale, and D represents the characteristic scale;
the step S105 specifically further includes the steps of:
the axis of symmetry H is calculated by the following formula:
H=0.5*N/D
the symmetry axis H represents that the sub-population subjected to self-adaptive initialization is subjected to discrete uniform distribution around the symmetry axis with the horizontal axis H in a target space, wherein the value taken by the horizontal axis of the target space represents the ratio of the selected feature number to the total number of features in the feature vector.
10. The storage device of claim 7, wherein the set of instructions is further configured to perform: the step S106 further includes:
a certain oneRow solution vector X ═ X1,x2,x3,...,xD) For each gene position X in XiMaking a random decimal r between 0 and 1, if r is less than H, then xiIf the number is 1, selecting the ith characteristic; otherwise, if r is greater than or equal to H, xi0, namely, the ith feature is not selected;
and repeating the steps until the self-adaptive initialization of the initial solution vector of the previous K rows is completed.
CN202110856379.6A 2021-07-28 2021-07-28 Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm Pending CN113554144A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110856379.6A CN113554144A (en) 2021-07-28 2021-07-28 Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110856379.6A CN113554144A (en) 2021-07-28 2021-07-28 Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm

Publications (1)

Publication Number Publication Date
CN113554144A true CN113554144A (en) 2021-10-26

Family

ID=78104739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110856379.6A Pending CN113554144A (en) 2021-07-28 2021-07-28 Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm

Country Status (1)

Country Link
CN (1) CN113554144A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202387A (en) * 2021-12-17 2022-03-18 安徽大学 Commodity recommendation method based on large-scale evolutionary algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202387A (en) * 2021-12-17 2022-03-18 安徽大学 Commodity recommendation method based on large-scale evolutionary algorithm
CN114202387B (en) * 2021-12-17 2024-02-20 安徽大学 Commodity recommendation method based on large-scale evolutionary algorithm

Similar Documents

Publication Publication Date Title
Chen et al. An evolutionary multitasking-based feature selection method for high-dimensional classification
CN103745258B (en) Complex network community mining method based on the genetic algorithm of minimum spanning tree cluster
Han et al. Competition-driven multimodal multiobjective optimization and its application to feature selection for credit card fraud detection
CN113407185B (en) Compiler optimization option recommendation method based on Bayesian optimization
Wan et al. Neural network-based multiobjective optimization algorithm for nonlinear beam dynamics
CN115469851A (en) Automatic parameter adjusting method for compiler
CN113344174A (en) Efficient neural network structure searching method based on probability distribution
CN113554144A (en) Self-adaptive population initialization method and storage device for multi-target evolutionary feature selection algorithm
CN110738362A (en) method for constructing prediction model based on improved multivariate cosmic algorithm
CN112819062B (en) Fluorescence spectrum secondary feature selection method based on mixed particle swarm and continuous projection
Phan et al. Efficiency enhancement of evolutionary neural architecture search via training-free initialization
CN111176865B (en) Peer-to-peer mode parallel processing method and framework based on optimization algorithm
CN114819151A (en) Biochemical path planning method based on improved agent-assisted shuffled frog leaping algorithm
CN110020725B (en) Test design method for weapon equipment system combat simulation
Rahati et al. Ensembles strategies for backtracking search algorithm with application to engineering design optimization problems
Yang et al. A hybrid evolutionary algorithm for finding pareto optimal set in multi-objective optimization
Kaya et al. A novel multi-objective genetic algorithm for multiple sequence alignment
Hadjiivanov et al. Epigenetic evolution of deep convolutional models
Owen et al. Adapting particle swarm optimisation for fitness landscapes with neutrality
Wu et al. An improved genetic algorithm based on explosion mechanism
Sathyapriya et al. Survey on N-Queen Problem with Genetic Algorithm
Dhivya et al. Weighted particle swarm optimization algorithm for randomized unit testing
CN113780146B (en) Hyperspectral image classification method and system based on lightweight neural architecture search
Li et al. Surrogate-Assisted Evolution of Convolutional Neural Networks by Collaboratively Optimizing the Basic Blocks and Topologies
Ito et al. OFA 2: A Multi-Objective Perspective for the Once-for-All Neural Architecture Search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination