CN110110753A

CN110110753A - Effective hybrid characteristic selecting method of pollination algorithm and ReliefF is spent based on elite

Info

Publication number: CN110110753A
Application number: CN201910266518.2A
Authority: CN
Inventors: 阎朝坤; 罗慧敏; 张戈; 马敬敬; 王建林
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-08-09
Anticipated expiration: 2039-04-03
Also published as: CN110110753B

Abstract

The present invention provides a kind of effective hybrid characteristic selecting method that pollination algorithm and ReliefF is spent based on elite.This method comprises: step 1, being initialized to the population being made of M individual using based on ReliefF feature ordering and double initial population strategies of randomization；Step 2 is spent pollination algorithm Population Regeneration using binary elite, and calculates the fitness value of each individual in population, and the globally optimal solution in population is obtained；Step 3 determines candidate solution using the neighborhood of tabu search algorithm search globally optimal solution, and updates taboo list according to the fitness value of candidate solution；Step 4 chooses the maximum individual of fitness value as elite individual in taboo list, and elite individual is replaced the smallest individual of fitness value in population, forms new population；Step 5, using step 2 to step 4 as an iteration, repeat step 2 to step 4, until current iteration number reaches the number of iterations of setting.The present invention can get high-class accuracy compared with other feature selection approach.

Description

Effective mixing feature selection method based on elite flower pollination algorithm and Relieff

Technical Field

The invention relates to the technical field of bioinformatics, in particular to an effective mixing feature selection method based on an elite flower pollination algorithm and Relieff.

Background

With the rapid development of biomedical technology and key technologies in the health field, a large amount of bioinformatic and clinical medical data, especially molecular biology experimental data, has grown and accumulated at a previously unprecedented rate and scale. The medical big data contains a large amount of valuable information, and data mining is carried out on the data, so that the disease law, risk factors and mutual influence among the risk factors related to the disease can be found, and reference is provided for clinical diagnosis and treatment of the disease. In recent years, researchers in related fields analyze microarray data, and compared and analyzed through experimental results, feasibility and effectiveness of the proposed method are proved, and a great amount of theoretical support is provided for research in the fields. However, as the gene data has more noise and redundant genes, some redundant genes are inevitably missed in the process of screening effective genes; meanwhile, because the dimensionality of gene data is high, the required time is long in a complex calculation process, and the efficiency of characteristic gene selection is low. Aiming at the problems, starting from the characteristics of gene microarray data, a tumor gene data set is analyzed and processed by using a machine learning related method, a plurality of feature selection algorithms enabling classification precision to be high are provided, and effectiveness is verified through comparison of experimental results. Therefore, the feature selection has been widely applied to the field of bioinformatics, and has important research significance and value for disease diagnosis and clinical treatment.

Feature selection techniques, which were first introduced in the last 60 th century, were essential to select an optimal subset of features from a set of features of raw data that meet certain evaluation criteria for use in the task of classification or regression. The method is mainly used for solving the problems related to statistics, signal processing and the like. Early feature selection algorithms did not perform satisfactorily in the direction of application because the relationships between features and categories, features and features were not considered. Since the 90 s of the last century, machine learning of large-scale data begins to appear in the visual field of people, traditional feature selection algorithms are severely challenged, an efficient global search technology is urgently needed to better solve the feature selection problem, and evolutionary computing is recently attracted by wide attention of the feature selection field because of the fact that the evolutionary computing is famous for the global search capability of the traditional feature selection algorithms. However, there is no comprehensive guideline as to the advantages and disadvantages of the alternative method and the most suitable application field. Researchers are therefore constantly trying to optimize machine learning and stochastic strategy algorithms, while introducing new intelligent algorithms to improve their computational efficiency and the quality of the selected feature subset.

At present, feature selection algorithms are classified according to search strategies, and there are mainly three feature selection algorithms based on different search strategies: feature selection algorithms based on exhaustive search strategies, feature selection algorithms based on random search strategies, and feature selection algorithms based on meta-heuristic search strategies.

(1) Feature selection algorithm based on exhaustive search strategy: the exhaustion method and the branch-and-bound method are the methods mainly adopted by the global optimization. The exhaustive method, which may also be referred to as a depletion search, selects the optimal feature subset that meets the requirements by searching each feature subset, such as a backtracking method, because it can traverse all feature sets, and thus can certainly find the globally optimal feature subset. However, if the number of original features is large, the search space naturally also becomes large, and the execution efficiency of the depletion search also decreases, which is not practical. The branch and bound method is to shorten the search time through pruning operation, and is the only method for obtaining the optimal result in the current global search, but it requires that the number of the optimal feature subsets is preset before the search is started and the evaluation function has monotonicity. Meanwhile, when the feature waiting for processing has a high dimension, it needs to be executed multiple times, and these requirements limit its application.

(2) A feature selection algorithm based on a random search strategy: the method combines feature selection with Genetic Algorithm (GA), Simulated Annealing (SA), Tabu Search (TS) and the like in the searching process, and is supported theoretically through probability and a sampling process. And carrying out weight assignment on each feature to be selected according to the classification effectiveness, judging the importance of the feature according to a defined or self-adaptively obtained threshold value, and outputting the feature of which the weight exceeds the threshold value. The random search method takes the classified performance as a judgment standard or obtains a better application effect. However, the time complexity is high, and the feature set which cannot be output is the optimal feature subset.

(3) Feature selection algorithm based on meta-heuristic search strategy: it is an approximate algorithm that trades off the computational burden and the optimality of the search. And (4) generating an optimal feature subset by continuous iteration by applying reasonable heuristic rule design. According to the difference between the starting feature set and the search direction, the method can be divided into single optimal feature selection, sequence forward selection, sequence backward selection, bidirectional selection and the like. The metaheuristic search has low complexity and high execution efficiency, and is very wide in application of practical problems. However, in the feature selection process, once a feature is deleted, it cannot be withdrawn, which may cause the algorithm to fall into local optimality.

At present, feature selection algorithms are classified according to evaluation strategies, and there are mainly four feature selection algorithms based on different evaluation strategies:

(1) based on filter type (Filters)

The filtering type feature selection algorithm is completely independent of the classification algorithm and is independent of the classification performance and other parameters of the classification algorithm. The filtered feature selection algorithm may be considered a data pre-processing process. The filtering type feature selection algorithm usually uses an independent evaluation function, and different filtering type feature selection algorithms can be obtained by changing the evaluation function and the searching mode. The versatility of the filtered algorithm makes it useful to solve various feature selection problems, but the selected feature subset classification performance is often lower than other algorithms because the filtered model easily ignores the relevance of features and the feature subsets provided may contain redundant information. Commonly used filtering algorithms include information gain, mutual information, t-detection, etc.

(2) Based on the encapsulation type (Wrappers)

The packed feature selection algorithm takes advantage of the classification performance of the feature subsets to obtain the best feature subset. The packaged feature selection algorithm combines the feature selection process with the learning algorithm to find a feature subset that optimizes the classification performance of the learning algorithm. Unlike filtered feature selection algorithms, the encapsulation model generally relies on classifiers to select feature subsets, which results in higher classification accuracy, but may appear to be overfitting and some long-running time for the optimal feature subset given the classification task and learning algorithm.

(3) Based on Hybrid (Hybrid Algorithm)

The filtering type feature selection algorithm and the packaging type feature selection algorithm have advantages and disadvantages respectively, and the hybrid type feature selection algorithm provides a way for utilizing the advantages of the two algorithms. Typical hybrid feature selection algorithms utilize both independent evaluation functions and learning algorithms to evaluate feature subsets: and selecting a group of candidate optimal subsets by using an independent evaluation function, and selecting a final optimal feature subset from the candidate optimal subsets by using a learning algorithm.

(4) Embedded base (Embedded Solutions)

Some learning algorithms have fixed structures, feature selection can be embedded into the learning algorithms, and the embedded feature selection algorithms can be constructed by using the learning algorithms. The embedded model may combine the features of both the filtered and encapsulated models. For example, in the decision tree algorithm, the basic unit node of the algorithm has a selection function, each node selects a feature with high classification capability, and the generation process of the decision tree is also a feature selection process. However, building a mathematical model of an embedded feature selection classifier is quite complex.

In summary, selecting the optimal feature subset composed of relevant features most valuable for classification from the original input data, and improving the classification accuracy as much as possible is the goal that the feature selection algorithm needs to achieve. However, many current intelligent algorithms cannot cover both targets.

Disclosure of Invention

Aiming at the problem that the existing feature selection algorithm can not simultaneously cover two targets of 'selecting the most valuable optimal feature subset composed of related features from the original input data' and 'improving the classification accuracy as much as possible', the invention provides an effective mixed feature selection method based on an elite pollination algorithm and a Relieff, which can further improve the classification accuracy of the features while selecting the optimal feature subset.

The effective mixing feature selection method based on the elite pollination algorithm and the Relieff, provided by the invention, comprises the following steps of:

step 1, initializing a population consisting of M individuals by adopting a double initial population strategy based on Relieff feature sorting and randomization;

step 2, updating the population by adopting a binary elite pollination algorithm, and calculating the fitness value of each individual in the population by adopting a set fitness function to obtain a global optimal solution in the population;

step 3, searching a neighborhood of the global optimal solution by adopting a tabu search algorithm according to a set tabu table to determine a candidate solution, and updating the tabu table according to the fitness value of the candidate solution;

step 4, selecting the individual with the largest fitness value from the tabu table as an elite individual, and replacing the elite individual with the smallest fitness value in the population to form a new population;

and 5, taking the steps 2 to 4 as one iteration, and repeating the steps 2 to 4 until the current iteration number reaches the set iteration number.

Further, the step 1 specifically comprises:

step 1.1, dividing M individuals into two groups on average: a first population and a second population;

step 1.2, initializing the first group by adopting a randomization process to form a first-class initial solution, specifically: for j-th feature X in individual i in the first population_ijRandomly generating a random number r, r is in the range of 0,1]If the random number r is less than the set initialization probability P, the characteristic X is_ijIs selected, otherwise X_ijIs not selected; setting the selected features to 1 and the unselected features to 0 for each individual; taking the solution formed by the initialized first population as a first type initial solution;

step 1.3, initializing the second group by adopting a weight sorting process to form a second type of initial solution, which specifically comprises the following steps: calculating the weight of each feature corresponding to each individual in the second group according to a set Relieff weight formula, and randomly selecting a plurality of features from the top TopN features with larger weight values for each individual; taking the solution formed by the initialized second population as a second type initial solution;

and step 1.4, combining the first initial solution and the second initial solution to obtain an initial optimal solution of the population.

Further, the set initialization probability P is calculated according to equations (3) and (4):

wherein ,representing the j-th bit characteristic value in the individual i at the t-th iteration, A representing the adaptive conversion factor,C₁ and C₂Denotes the variation factor and T denotes the set number of iterations.

Further, in step 2, the updating the population by using the binary elite pollination algorithm specifically comprises:

if cross-pollination is used, individuals in the population are pollinated according to equation (5)Updating:

wherein , andrespectively representing the positions of the individual i at the t +1 th iteration and the t th iteration; f is the current global optimal solution; gamma is a scale factor; l (λ) is the step size of the Levy flight; gamma (λ) is a standard gamma function, λ ∈ [1,2 ]](ii) a And S is a moving step length.

Further, in step 2, the updating the population by using the binary elite pollination algorithm further comprises:

step 2.1, if self-pollination operation is adopted, selecting n optimal individuals from the population according to the fitness value, randomly selecting an individual m and an individual k from the selected n optimal individuals, and updating an individual i in the population according to a formula (7) to obtain a new individual i:

wherein A is an adaptive conversion factor, andrespectively representing the positions of an individual m and an individual k at the t-th iteration; c₁ and C₂Represents a change factor; t denotes the set number of iterations.

Step 2.2, calculating the fitness value of the new individual i according to a set fitness function, if the fitness value of the new individual i is larger than the fitness value of the individual i before updating, adopting the new individual i to replace the individual i before updating, and otherwise, abandoning the new individual i;

and 2.3, repeating the steps 2.1 to 2.2 until all the individuals in the population are updated.

Further, in step 2, the set fitness function is specifically:

wherein ,acc denotes the classification accuracy of the sample, num_cNumber of samples, num, correctly classified_iThe number of samples representing classification errors, N represents the number of the samples of which the fitness values are to be calculated corresponding to the selected features, N is the number of the samples of which the fitness values are to be calculated corresponding to all the features, α is the weight of classification accuracy, β is the weight of feature selection, and α + β is 1.

Further, the step 3 specifically includes:

step 3.1, setting initialization parameters: the length of initialization of a tabu table is tabuLength, and the number of generated neighborhood solutions is numNeighbor;

3.2, selecting an initial solution, wherein the initial solution is an optimal solution generated by local search in a pollination algorithm in the current iteration process;

step 3.3, if judging that the obtained current iteration times are equal to the maximum iteration times, ending the iteration process and taking the current optimal solution as the final optimal solution; otherwise, performing step 3.4;

step 3.4, generating a neighborhood solution through the current solution to form a candidate solution;

3.5, if the candidate solution is judged to be not in the tabu table and the fitness value of the candidate solution is larger than that of the initial solution, replacing the initial solution with the candidate solution, adding the candidate solution into the tabu table, and repeating the step 3.3; and if the candidate solution is judged to be in the tabu table, repeating the step 3.3.

Further, the step 4 specifically includes:

4.1, sequencing all individuals in the tabu table according to the size of the fitness value;

step 4.2, storing the individual with the maximum fitness value into an elite population;

and 4.3, updating the elite population after the current iteration process is finished, replacing the worst individual in the population by the elite individual in the elite population, and carrying out the next iteration.

The invention has the beneficial effects that:

according to the method for selecting the effective mixed characteristics based on the elite flower pollination algorithm and the Relieff, the initialization process of characteristic sorting based on the Refieff algorithm aims to select an important characteristic subset, and the elite strategy improves the convergence rate of the flower pollination algorithm. In addition to the characteristic of simplifying redundancy during initialization, the characteristics that the local search capability of the flower pollination algorithm is weak and the flower pollination algorithm is easy to fall into local optimum are also considered, so that the flower pollination algorithm is improved by adopting a tabu search and adaptive Gaussian variation strategy, the diversity of the population can be increased, and the local search performance is improved. And carrying out classification verification by combining the searched optimal feature subset with a classification algorithm and 10-fold intersection. Testing and verifying on eight public biomedical data sets, the invention can effectively simplify the number of gene expression levels and obtain high classification accuracy compared with other feature selection methods.

Drawings

FIG. 1 is a schematic flow chart of an efficient hybrid feature selection method based on an elite pollination algorithm and a Relief according to an embodiment of the present invention;

FIG. 2 is a block diagram of a random initialization solution X provided by an embodiment of the present invention_ijA schematic diagram of (a);

fig. 3 is a schematic diagram of population initialization based on ReliefF sorting according to an embodiment of the present invention;

FIG. 4 is a second schematic flow chart of an efficient hybrid feature selection method based on an elite pollination algorithm and a Relief according to an embodiment of the present invention;

FIG. 5 is a comparison diagram of the average feature number in feature selection based on different intelligent algorithms for the same data set according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a comparison of average fitness values of different intelligent algorithms based on the same data set in feature selection according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a comparison of the runtime of different intelligent algorithms on feature selection based on the same data set according to an embodiment of the present invention;

FIG. 8 is a graph illustrating a comparison of average fitness values of Relieff-EFPATS, EFPATS and BCFPA according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The feature selection method based on elite pollination algorithms EFPA and Relieff provided by the invention is called Relieff-EFPATS method for short. As shown in fig. 1, the ReliefF-EFPATS method provided by the present invention includes the following steps:

s101, initializing a population consisting of M individuals by adopting a double initial population strategy based on Relieff feature sorting and randomization;

specifically, the present step includes the following substeps:

s1011, dividing M individuals into two groups on average: a first population and a second population;

s1012, initializing the first group by using a randomization process to form a first type of initial solution, specifically: for j-th feature X in individual i in the first population_ijRandomly generating a random number r, r is in the range of 0,1]If the random number r is less than the set initialization probability P, the characteristic X is_ijIs selected, otherwise X_ijIs not selected; setting the selected features to 1 and the unselected features to 0 for each individual; taking the solution formed by the initialized first population as a first type initial solution;

s1013, initializing the second group by adopting a weight sorting process to form a second type of initial solution, specifically: calculating the weight of each feature corresponding to each individual in the second group according to a set Relieff weight formula, and randomly selecting a plurality of features from the top TopN features with larger weight values for each individual; taking the solution formed by the initialized second population as a second type initial solution;

as an implementation manner, the set ReliefF weight formula is specifically:

wherein X is a training sample set, X_i∈{X₁,X₂,…X_m}; y is a class label set, Y ═ Y₁,Y₂…Y_n}. Randomly selecting an individual X from a training set_iIs of class Y_i(i ∈ n). W (f) is the weight of feature f, t is the number of iterations, diff (f, X)₁,X₂) Representing an individual X₁And individual X₂As for the difference in the characteristic f,first from and X_iSearching k nearest neighbor individuals H in individuals of the same class_jThen from and X_iFinding ktop among heterogeneous individualsNeighbor M_j(j ∈ k), this step is repeated t times.

And S1014, combining the first initial solution and the second initial solution to obtain an initial optimal solution of the population.

For example, M flowers are randomly selected as the initial population. For the problem of feature selection, binary coding is usually used to represent feature subsets, and each flower is modeled as a binary string as a candidate solution. One bit in the binary string represents one feature corresponding to one flower, and the length of the binary string represents the total number of features corresponding to the flower. Wherein, the jth bit value of "1" indicates that the jth feature is selected, and the jth bit value of "0" indicates that the jth feature is not selected. M/2 individuals were randomly initialized, and the initialization solution is shown in fig. 2. The remaining M/2 individuals are formed by weight sorting of the Relieff algorithm, and the weight calculation formula is shown below. The method comprises the steps of firstly calculating the weight of all features of each individual, selecting top TopN features, and then selecting M/2 different feature subsets from the features to form M/2 different initial populations. The initialized population based on ranking is shown in fig. 3.

In modeling each flower as a binary string, the following may be used:

and converting the positions of the pollen into binary characteristic vectors by using a Sigmoid function and a random strategy in a binary flower pollination algorithm BFPA. The Sigmoid function is a common biological Sigmoid function, also called Sigmoid growth curve, which maps all variables between (0, 1). Then, the pollen stigma of each flower is converted into a binary variable 0 or 1 by combining a random strategy, wherein '1' represents that the corresponding feature is selected, and '0' represents that the feature is abandoned. Meanwhile, an adaptive conversion factor A is introduced into the sigmoid conversion function as a calculation formula for calculating the initialization probability P, which is shown in the following formulas (3) and (4). The effect of introducing the adaptive conversion factor a is to enhance the uncertainty of converting a linear solution to a discrete solution, and also to enhance the ability to traverse the solution space. And enhances the conversion to improve the search position when converging on the optimum value at a later stage of the algorithm implementation.

wherein ,representing the j-th bit characteristic value in the individual i at the t-th iteration, A representing the adaptive conversion factor,C₁ and C₂Denotes the variation factor and T denotes the set number of iterations. Whether each feature in the new individual is selected depends on the value of the sigmoid function.

S102, updating the population by adopting a binary elite pollination algorithm, and calculating the fitness value of each individual in the population by adopting a set fitness function to obtain a global optimal solution in the population;

specifically, the binary elite flower pollination algorithm is divided into cross-pollination and self-pollination operations (also known as adaptive gaussian mutation). The method comprises the following substeps:

if cross pollination is adopted, executing the step S1021:

s1021, comparing individuals in the population according to the formula (5)Updating:

Specifically, for the cross pollination operation, in the initialization stage, various parameters including the population number n of flowers, the transformation rate p and the search space of the pollen are initialized randomly. In the pollination stage, the binary system flower pollination algorithm BFPA is iterated continuously, and the optimal solution is searched through operation operators of global pollination, clonal selection and local pollination until a convergence condition is met. The cross pollination is to spread the pollen by the flying of bees or insects, the flying process obeys Levy flying distribution, and the local pollination and the global pollination are controlled by the switching probability p E [0,1 ]. And when the conversion probability p is greater than rand, carrying out global pollination, and updating the current optimal solution according to the formula (5).

Among them, the most effective method to generate the moving step S is the Mantegna algorithm, which calculates the moving step S by using two gaussian distributions of U and V transforms:

wherein, U to N (0, sigma)_u ²),V～N(0，σ_V ²)， Due to sigma_u and σ_VMay be large or small, so the step size S of the movement of the Levy flight and the direction of flight may be randomly changed from one flower to another. This not only increases the diversity of the search space, but also improves the global optimization capability of the BFPA algorithm.

If cross pollination operation is adopted, executing the steps S1022 to S1024:

s1022, selecting n optimal individuals from the population according to the fitness value, randomly selecting an individual m and an individual k from the selected n optimal individuals, and updating the individual i in the population according to a formula (7) to obtain a new individual i:

The self-adaptive transfer factor A is used for enhancing the conversion from a linear solution to a discrete solution, and the self-adaptive transfer factor A is arranged, so that the capability of an individual jumping out of a local optimum is improved, and the convergence speed of an original flower pollination algorithm is accelerated.

S1023, calculating the fitness value of the new individual i according to a set fitness function, if the fitness value of the new individual i is larger than the fitness value of the individual i before updating, adopting the new individual i to replace the individual i before updating, and otherwise, discarding the new individual i;

the feature selection can be regarded as a multi-objective optimization problem, and an appropriate objective function (referred to as fitness function in the invention) needs to be set as an optimization objective of the algorithm. The fitness function is to achieve two contradictory goals; the minimum feature number is selected and the classification precision is improved to the maximum extent. The smaller the number of the feature subsets selected each time is, the higher the classification precision is, and the better classification effect of the provided model is proved.

Each solution is evaluated according to a proposed fitness function that depends on the search algorithm, the classifier, the number of features to obtain the optimal solution, the classification accuracy of the solution, and the number of selected features in the solution. In order to balance between the number of features (minimum) and the classification accuracy (maximum) selected in each solution, as an implementable embodiment, the fitness function set is specifically:

And S1024, repeating the steps S1022 to S1023 until all the individuals in the population are updated.

S103, searching a neighborhood of the global optimal solution by adopting a tabu search algorithm according to a set tabu table to determine a candidate solution, and updating the tabu table according to the fitness value of the candidate solution;

specifically, the present step includes the following substeps:

s1031, setting initialization parameters: initializing a tabu table with the length of tabuLength, and generating numNeighbor with the number of neighborhood solutions;

s1032, selecting an initial solution, wherein the initial solution is an optimal solution generated by local search in a pollination algorithm in the current iteration process;

s1033, if the current iteration times are judged and obtained to be equal to the maximum iteration times, ending the iteration process, and taking the current optimal solution as a final optimal solution; otherwise, go to step S1034;

s1034, randomly selecting a feature through the current solution to carry out single-point mutation so as to generate a neighborhood solution and form a candidate solution;

s1035, if the candidate solution is judged to be not in the tabu table and the fitness value of the candidate solution is larger than that of the initial solution, replacing the initial solution with the candidate solution, adding the candidate solution into the tabu table, and repeating the step S1033; if the candidate solution is judged to be in the tabu table, the step S1033 is repeated.

Tabu search is a neighborhood search algorithm that mimics the characteristics of human memory. At the heart of tabu search, with local search and tabu mechanism, at each iteration, the algorithm searches the neighborhood of the optimal solution to obtain a new solution with improved functional value.

S104, selecting the individual with the largest fitness value from the tabu table as an elite individual, and replacing the elite individual with the smallest fitness value in the population to form a new population;

specifically, the present step includes the following substeps:

s1041, sequencing all individuals in the tabu table according to the size of the fitness value;

s1042, storing the individual with the maximum fitness value into an elite population;

and S1043, updating the elite population after the current iteration process is finished, replacing the worst individual in the population by the elite individual in the elite population, and performing the next iteration.

In the improved flower pollination algorithm provided by the embodiment of the invention, because of the existence of the fitness value, a new individual is generated in the searching process, and the phenomenon keeps the population diversity on one hand, so that the algorithm has better global searching capability, but on the other hand, the convergence speed of the algorithm is also slowed down, and the calculation accuracy is reduced in limited calculation times. In order to improve the convergence speed of the algorithm, an elite strategy is introduced after each iteration is finished, in order to keep the scale of the population unchanged, elite individuals are replaced by the worst solution, and if the elite individuals are added into a new-generation population, the individuals with the minimum fitness value in the new-generation population can be eliminated.

And S105, taking the S102 to the S104 as one iteration, and repeatedly performing the S102 to the S104 until the current iteration number reaches the set iteration number.

From the above embodiments, the search process of the present invention is based on the effective mixing method of binary elite pollination algorithm combined with tabu search. The initial process of feature sorting based on the RefiefF algorithm aims to select an important feature subset, and the elite strategy improves the convergence rate of the flower pollination algorithm. In addition to the characteristic of simplifying redundancy during initialization, the characteristics that the local search capability of the flower pollination algorithm is weak and the flower pollination algorithm is easy to fall into local optimum are also considered, so that the flower pollination algorithm is improved by adopting a tabu search and adaptive Gaussian variation strategy, the diversity of the population can be increased, and the local search performance is improved.

To validate the effectiveness of the method, the selection performance of the improved flower pollination algorithm was tested using 10-fold cross validation in the following manner.

1. Data set and evaluation index

The biological data set used in this experiment is shown in table 1:

table 1: description of data sets

And evaluating the feature subsets by using a 10-fold cross validation method and combining a KNN classifier, randomly dividing the feature subsets in the process data set of the 10-fold cross validation into ten parts, taking nine parts as a training data set in turn, and taking the rest part as a test set for testing. In the experiment, all algorithms obtain the average value of ten results as the estimation of the accuracy of the algorithm.

As shown in fig. 4, feature selection was performed on the microarray dataset M × N according to the procedure shown in fig. 4, and the results of the feature selection were subjected to performance tests.

(1) Number of average feature subsets (AvgN)

Under eight biological data sets, the feature subset selection capability of different algorithms under the same data set can be judged through the number of selected feature subsets. Analysis results as shown in tables 2 and 3, from the analysis results, selecting fewer features means eliminating redundant features and reducing the search space, and the reliefff-EFPATS selection functions about 6 times less than BCROSAT, IG-GA and ISFLA. The features selected by Relieff-EFPATS are 2 times less compared to ABC.

(2) Average precision (Acc%)

Average accuracy is also an important indicator, as shown in tables 2 and 3, it can be seen that reliefff-EFPATS achieves the best average accuracy (Acc) compared to other algorithms on most data sets. For the data sets SRBCT and LungCancer, Relieff-EFPATS has Acc similar to BBHA, which achieves higher accuracy.

(3) Standard deviation (std)

In order to verify the robustness of the algorithm, the standard deviation corresponding to the average accuracy of the corresponding indexes and the selected average feature number is obtained by running for 10 times in the experiment. The standard deviation is a measure of the amplitude of a set of number changes, and it is obvious that the smaller the standard deviation is, the more stable the experimental result is proved.

(4) Average fitness value (Avgf%)

The fitness value is averaged and the two goals of maximum classification accuracy and subset optimum length of feature selection can be well balanced. As shown in FIG. 6, Relieff-EFPATS is superior to the other three algorithms in the average adaptability of ALL-AML, Colon Tumor, MLL, Lung Cancer. For the data set CNS, the average fitness of Relieff-EFPATS is slightly worse than BBHA, but significantly better than the other four algorithms.

(5) Run Time (Time)

The feature selection is to reduce the dimensionality of the original data and improve the efficiency of the search mechanism. The time consumption of feature selection for high-dimensional biological data sets is also considered here. The runtime of the algorithm depends on the convergence power of the algorithm and the size of the data set. Figure 7 shows the average calculated time comparison for all algorithms. Relieff-EFPATS converges about 3 times faster than ISFLA and IG-GA. The rapidity of Relieff-EFPATS is roughly similar to BBHA on both data sets. As can be seen from fig. 7, the proposed algorithm ReliefF-EFPATS achieves higher performance on eight baseline disease datasets in a short time. In the case of SRBCT, the execution time of Relieff-EFPATS is slightly larger than that of BBHA algorithm. The larger the sample size of the dataset, the longer the runtime. For example, Lung Cancer runs more time than other data sets. In general, the proposed Relieff-EFPATS is more time cost effective than BBHA, BCROSAT, ISFLA, IG-GA and ABC.

2. Comparison with other algorithms

(1) Comparison with other algorithms for that direction

In order to visually see the performance of the improved clone pollination algorithm, the experiment introduces different algorithms for comparison, the black Hole algorithm combines with the chi-square detection BBHA (Black Hole algorithm), the genetic algorithm GA (IG-GA), the improved Leaping algorithm ISFLA (improved shredded Frog Leaping algorithm), the Binary Artificial Bee Colony algorithm Binary Artificial Bee Colony (ABC), and the Binary coral reef algorithm Binary Coral Reefs Optimization (BCROSAT). And experiments were performed on eight biological datasets ALL-AML, ColonTumor, CNS, MLL, SRBCT, Lymphoma, Lung Cancer. The results of the experiment are shown in table 2.

Table 2: comparing Relieff-EFPATS with previous algorithms

(2) Comparison with FaefPATS and Algorithm BCFP

To further test the impact of the improvement strategy, the algorithm ReliefF-EFPATS of the invention was compared with the elite flower pollination algorithm EFPATS and the binary clonal flower pollination algorithm BCFPA. As can be seen from Table 3, the hybrid algorithm Relieff-EFPATS is for two targets: the classification accuracy and the number of selected attributes are much better than the binary elite pollination algorithm EFPATS, further proving that the Relieff algorithm is helpful for accelerating the convergence speed. In summary, ReliefF-EFPATS achieves better classification performance and has good robustness in addition to most data sets. Based on the search algorithm FPA, the new hybrid algorithm Relieff-EFPATS makes the search more efficient.

Table 3: experimental results for Relieff-EFPATS, EFPATS and BCFPA

(3) Example analysis

The effectiveness of some of the present invention in feature selection has been illustrated above by averaging the feature subset picks and the average fitness value. Next, classification accuracy and standard deviation of the Relieff-EFPATS on each data set are continuously analyzed to judge the stability. As shown in Table 2, the average accuracy of the Relieff-EFPATS is more stable with the smallest standard deviation except in the data sets SRBCT and Lung Cancer. As can be seen from the results of table 3 and fig. 8, reliefff-EFPATS have highly competitive results compared to these EFPATS and BCFPA, in addition to the lung cancer microarray dataset. The Binary Clonal Flower Pollination Algorithm (BCFPA) performed worse on both Acc and Avgf on almost all datasets.

To reveal the search process of Relieff-EFPATS, the optimal solution obtained by Relieff-EFPATS for all datasets is listed in Table 4, which further demonstrates the effectiveness of the relative advantage-based selection strategy. As can be seen from Table 4: the optimal gene can be searched by the algorithm Relieff-EFPATS. For example, leukemias ALL-AML and Lymphoma Lymphoma, gene M23197(CD33 antigen) has been identified by literature search as playing a key role in ALL-AML. In the lymphoma data set, the GENEs GENE639X and GENE1610X have a very high correlation with disease lymphomas. This further demonstrates the effectiveness of the proposed method in finding important features of high dimensional biomedical datasets.

Table 4: description of optimal solution obtained by Relieff-EFPATS

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An effective mixing feature selection method based on an elite pollination algorithm and a Relieff is characterized by comprising the following steps:

2. The method according to claim 1, wherein step 1 is specifically:

3. The method of claim 2, wherein the set initialization probability P is calculated according to equations (3) and (4):

4. The method of claim 1, wherein in step 2, said updating said population with a binary elite pollination algorithm comprises:

if cross-pollination is used, updating the individuals i in the population according to the formula (5):

wherein , andrespectively representing the positions of the individual i at the t +1 th iteration and the t th iteration; f is the current global optimal solution; gamma is a scale factor; l (λ) is the step size of the Levy flight; Γ (λ) is the standard gamma function,λ∈[1,2](ii) a And S is a moving step length.

5. The method of claim 4, wherein in step 2, said updating said population with a binary elite pollination algorithm further comprises:

6. The method according to claim 1 or 5, wherein in step 2, the set fitness function is specifically:

7. The method according to claim 1, wherein step 3 is specifically:

step 3.4, randomly selecting a feature through the current solution to carry out single-point mutation so as to generate a neighborhood solution and form a candidate solution;

8. The method according to claim 1, wherein step 4 is specifically: