CN110097169A

CN110097169A - A kind of high dimensional feature selection method mixing ABC and CRO

Info

Publication number: CN110097169A
Application number: CN201910381688.5A
Authority: CN
Inventors: 阎朝坤; 张戈; 王建林; 和婧; 闫永航
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2019-08-06

Abstract

The present invention relates to bioinformatics technique fields, disclose a kind of high dimensional feature selection method for mixing ABC and CRO, comprising: use and initialized based on artificial bee colony algorithm ABC with finding the population that the strategy of the i.e. function fitness of best food source forms individual；Initialization population is updated using chemical reaction algorithm CRO, and calculates fitness value individual in population using the fitness function of setting, obtains the globally optimal solution in population；Elite molecule population is formed using elite retention strategy, and updates elite population after an iteration, obtained elite molecule is incorporated in population and carries out next iteration, until current iteration number reaches the number of iterations of setting；The performance of high dimensional feature selection is verified using 10 times of cross-validation method combination KNN classifiers, the high dimensional feature selection method of this mixing ABC and CRO improves the ability of searching optimum of algorithm, enhances population diversity, avoiding falling into local optimum to a certain degree.

Description

A kind of high dimensional feature selection method mixing ABC and CRO

Technical field

The present invention relates to field of bioinformatics, in particular to a kind of high dimensional feature selection method for mixing ABC and CRO.

Background technique

There is a large amount of High Dimensional Data Sets in practical application, although high dimensional data may more accurately indicate things, It is as the number of features of description data is more and more, data dimension is higher and higher, and wherein quite a few feature may be with Mining task be incoherent or feature between mutual redundancy.These incoherent features are not only due to data latitude is high and reduces The performance of data analysis task, and redundancy feature may also reduce the accuracy of data analysis task.It is valuable in view of extracting Information and determine large data sets important feature in terms of challenge, feature selecting (also referred to as variables choice or Attributions selection) is Through causing the interest of numerous areas.Feature selecting is applied to the data set with known features, attempts the important spy of identification data Sign, and concentrated from primitive character and abandon uncorrelated or redundancy feature.With the rapid development of information technology, traditional mode is known Other technology has been unable to satisfy the requirement of a large amount of uncorrelated features in processing higher-dimension Small Sample Database, improves the property of feature selecting algorithm It can become more and more important.

Usual feature selection process includes with the next stage: subset generates, subset evaluation and result verification.The mesh of this process Be remove uncorrelated or redundancy feature, the lesser feasible subset of generation.Common feature selection approach be divided into Filter, Wrapper and Embedded three classes.For Filter model, using the distribution character of data inherently as feature selecting Foundation, without utilizing any mining algorithm, such as T-test algorithm.Wrapper model is dependent on classification method to character subset Assessment, this makes Wrapper model possess higher nicety of grading in three kinds of methods, such as SA algorithm.Embedded model is then straight It connects and feature selection process is dissolved into label learning algorithm, such as decision tree.In contrast, Wrapper model utilizes engineering The performance for practising algorithm carries out feature selecting as evaluation criteria, so that Wrapper model is more flexible and is handling higher-dimension It is more efficient when data.In recent years, Wrapper model seeks global optimum by meta-heuristic method solution feature selection issues Solution causes many concerns.

Vieira etc. proposes a kind of improved Binary Particle Swarm Optimization (MBPSO) for feature selecting, simultaneously Optimize SVM kernel parameter to be arranged to predict the death rate of sepsis patient.

Subanya etc. finds best features in heart disease identification using binary system artificial bee colony algorithm (BABC) Then collection assesses the feature of these selections using KNN model.

Hu etc. proposes a kind of improved shuffled frog leaping algorithm (ISFLA) progress feature selecting, and the algorithm is mixed by introducing Ignorant memory weight factor, absolute equilibrium group policy and adaptive transfer factor improve the accuracy rate and performance of feature selecting.

Babatunde etc. combines K arest neighbors (KNN) classifier to carry out feature selecting, this method using genetic algorithm (GA) Obtaining in multiple indexs such as nicety of grading has better result than previous certain methods.

Improved grey wolf optimization algorithm (IGWO) and kernel extreme learning machine (KELM) are combined and carry out feature selecting by Li et al.. Diversified initial position is first generated using heredity, the current of population in discrete search space is then updated using grey wolf optimization Position, to obtain optimal feature subset.

Although the above-mentioned heuritic approach for feature selecting has the advantages that respective, none of meta-heuristic Algorithm is able to solve all feature selection issues.Therefore we need to explore new meta-heuristic searching algorithm or mixing and search Rope algorithm is applied in biometrical features selection.The invention proposes a kind of new hybrid algorithm AB- based on meta-heuristic The characteristics of CRO algorithm, which combines artificial bee colony algorithm (ABC) and chemical reaction algorithm (CRO), optimizes feature selecting.

Summary of the invention

The present invention provides a kind of high dimensional feature selection method for mixing ABC and CRO, can solve in the prior art above-mentioned Problem.

The present invention provides a kind of high dimensional feature selection methods for mixing ABC and CRO, comprising the following steps:

Step 1 is used based on artificial bee colony algorithm ABC to find the strategy of the i.e. function fitness of best food source to individual The population of composition is initialized；

Step 2 updates initialization population using chemical reaction algorithm CRO, and calculates institute using the fitness function of setting Fitness value individual in population is stated, the globally optimal solution in population is obtained；

Step 3 forms elite molecule population using elite retention strategy, and updates elite kind after an iteration Group, obtained elite molecule is incorporated in population and carries out next iteration；

Step 4 is divided using 10 times of cross-validation methods (10-fold Cross Validation Technique) in conjunction with KNN Class device come verify high dimensional feature selection performance, with assess classification effect；

Step 5, using step 2 to step 4 as an iteration, repeat step 2 to step 4, until current iteration time Number reaches the number of iterations of setting.

The step 1 specifically:

Step 1.1 initializes population using artificial bee colony algorithm ABC, forms new initialization population and parameter, Bee, observation bee is employed to be equal to food source quantity M, the number of bee colony is NP=M, and employing bee number is SN, random to generate NP=M Food source, ABC algorithm maximum number of iterations are itermax, and it is limit that maximum, which stagnates number,；

Step 1.2, using randomisation process to population carry out initialization form initial solution, are as follows: in the individual i in group Jth position feature X_ijRandom number r, r a ∈ [0,1] is randomly generated, the feature if the initialization probability P that random number r is less than setting X_ijIt is selected, otherwise X_ijIt is not selected；1 is set by selected feature for each individual, not selected feature setting It is 0；The solution that group after initialization is formed is as initial solution；

Step 1.3 generates new explanation, and calculates the fitness value of new explanation using employing bee to carry out neighborhood search New food source, If the fitness value of new explanation is greater than the solution of original initialization, original solution is updated with new explanation, otherwise keeps original solution It is constant；Preferable food source is selected with greedy algorithm, the probability that food source is selected is calculated, is produced around the food source selected A raw new food source, calculates the fitness value of new food source, and the high substitution fitness value of fitness value is low, updates food Material resource；Record food source best so far；

Abandoned food source is judged whether there is, if so, bee is led to be converted into investigation bee, investigation bee random search is new Food source；

Step 1.4 judges whether current food source is enhanced in scheduled the number of iterations, if not being enhanced, needs It reinitializes food source, otherwise return step 1.2, obtains initial population optimal solution, i.e., optimal food source.

Bee is employed to generate SN food source at random according to formula (1) in the step 1.1

Rand (0,1) indicates that the uniform random number between (0,1), N are the dimension for optimizing space in formula (1)；

It employs the bee search phase: employing bee to carry out food source search and find candidate New food source formula (2) are as follows:

New food source location update formula (3) are as follows:

In formula (2),For the random number between [- 1,1], k ∈ (1,2 ..., SN) and X_ij≠X_kj, new and old food source is logical The greedy rule of formula (3) is crossed to be selected, i.e., the fitness value of new and old food source, if New food source V_ijAdaptation Angle value is better than old food source X_ijFitness value, then use New food source V_ijPosition replace old food source X_ijPosition, it is otherwise old Food source X_ijIt is constant, it is still current foodstuff source position, while bee being employed to add 1 in the stagnation number of the food source；

Observation bee follows probability are as follows:

In formula (4), k ∈ (1,2 ..., SN), in this stage, observe bee according to employ the food source information of bee acquisition by It selects food source according to the mode of roulette and carries out food source search, if selection, which follows, is employed bee, according to employing bee to search for The method in stage carries out the search of New food source.

In the step 2, the population is updated using chemical reaction algorithm CRO and is specifically included:

The initial population formed using ABC algorithm, is reacted by four basic operation operators of CRO algorithm, is updated Obtain globally optimal solution.

In the step 2, the CRO algorithm Population Regeneration is specifically included:

Step 2.1, setting initiation parameter, the kinetic energy initial value InitialKE of molecule；Central Energy buffer Buffer；Decomposition threshold α；Synthesize threshold value beta；Unimolecule collides KE loss late KELossRate in vain；Molecular kinetic energy KE, potential energy PE；

The operation of four kinds of step 2.2, CRO algorithm operators, unimolecule collide in vain；Decompose: decomposition reaction is that unimolecule is anti- It answers, corresponding global search process；Intermolecular elastic collision: being two intermolecular chemical reactions, is to obtain new explanation and carry out Local search procedure；Synthetic reaction: being intermolecular chemical reaction process；

Step 2.3, using the fitness value of each individual of KNN classifier evaluation, if the fitness value of the new individual is big Fitness value before individual updates then replaces the individual before updating using the new individual, otherwise gives up the new individual；

Step 2.4 repeats step 2.1 to 2.3, until having updated the individual in population.

In the step 2.3, used in the fitness value using the i.e. selected character subset of each individual of KNN classifier evaluation Function formula (5) are as follows:

WhereinAcc indicates the classification accuracy of sample, num_cPresentation class is just True sample number, num_iThe sample number of presentation class mistake, n indicate the corresponding selected feature of the sample of fitness value to be calculated Number, N are the numbers that the sample of fitness value to be calculated corresponds to all features, and δ is the weight of classification accuracy, and θ is feature choosing The weight selected ,+θ=1 δ.

The step 3 specifically:

It selects the smallest 5 molecules of PE to be stored in elite molecule population, elite molecule population is updated after an iteration, And obtained elite molecule is incorporated in population and carries out next iteration.

At this point, it should be noted that being needed after elite population is added in energy due to the principle of the reaction front and back conservation of energy Measure the energy for mitigating elite molecule in buffer buffer；

Buffer=buffer-PE (ω_elite)-KE(ω_elite)

PE(ω_elite) it is elite molecular potential, KE (ω_elite) it is elite molecular kinetic energy.

The step 4 specifically:

For the validity of verification result, we carry out sample using ten times of cross-validation methods and KNN classifier effective Verifying, original sample collection is randomly divided into 10 parts, successively using a copy of it as test set, other parts as training set, thus The average value of 10 results is calculated, to assess the effect of classification, (K-Nearest Neighbor algorithm, K are most by KNN Neighbor method) it is a kind of statistical class set, the screening of the characteristic variable of data is particularly effective.

Compared with prior art, the beneficial effects of the present invention are:

The ability of searching optimum of the present invention combination ABC convergence speed of the algorithm and CRO algorithm, the invention proposes mixing to calculate Method AB-CRO, elitism strategy improve CRO convergence speed of the algorithm, and CRO improves the ability of searching optimum of ABC algorithm, and It introduces randomness during molecule reacts, and this avoid molecules to fall into local optimum in the reaction.It will search The optimal feature subset of rope is brought into sorting algorithm combination 10- folding intersection and carries out classification verifying, in eight open biomedical numbers According to verifying is tested on collection, which effectively reduces the number of feature selecting, and obtains compared with other feature selection approach High-class accuracy.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of high dimensional feature selection method for mixing ABC and CRO provided by the invention.

Fig. 2 is that the initialization of AB-CRO algorithm of the present invention and molecule react schematic diagram.

Fig. 2 (a) is initialization vector solution X of the present invention_iBinary form diagram.

Fig. 2 (b) is that unimolecule of the present invention collides schematic diagram in vain.

Fig. 2 (c) is unimolecule decomposition reaction schematic diagram of the present invention.

Fig. 2 (d) is the intermolecular invalid collision schematic diagram of the present invention.

Fig. 2 (e) is the intermolecular synthetic reaction schematic diagram of the present invention.

Fig. 3 is that the average fitness value of algorithms of different of the present invention compares.

Fig. 4 is the ratio that the average characteristics that algorithms of different of the present invention is chosen account for total characteristic.

Fig. 5 is the comparison of average operating time of the different classifications device of the present invention on given data collection.

Specific embodiment

The specific embodiment of the present invention is described in detail in 1-5 with reference to the accompanying drawing, it is to be understood that this hair Bright protection scope is not limited by the specific implementation.

As shown in Figure 1, the present invention provides a kind of high dimensional feature selection methods for mixing ABC and CRO, which is characterized in that The following steps are included:

Step 4, the performance that high dimensional feature selection is verified using 10 times of cross-validation method combination KNN classifiers；

The present invention the specific implementation process is as follows:

S101, initialization is carried out using ABC forming new initialization population and parameter, bee and observation bee etc. are employed in setting In population quantity M.

Each solution is expressed as the string of binary characters of a N-dimensional, as shown in Fig. 2 (a).

S1011, using randomisation process to population carry out initialization form initial solution, are as follows: in the individual i in group Jth position feature X_ijRandom number r, r a ∈ [0,1] is randomly generated, the feature X if the initialization probability P that random number r is less than setting_ij It is selected, otherwise X_ijIt is not selected；1 is set by selected feature for each individual, not selected feature is set as 0；The solution that group after initialization is formed is as initial solution；

S1012, it employs bee to carry out neighborhood search generation new explanation, and calculates its fitness value, if the fitness value of new explanation Greater than original solution, then more new explanation, otherwise constant；Preferable food source is selected with greedy method, calculates the select probability of food source, A new food is generated around food source, calculates the fitness value of new food source, the high substitution fitness value of fitness value Low, more New food source.It generates investigation bee and distributes a character subset for each investigation bee.Judge whether there is abandoned food Material resource, if so, bee is led to be converted into investigation bee, the new food of investigation bee random search

Source；Bee is employed to generate SN food source position at random according to formula (1)

Rand (0,1) indicates that the uniform random number between (0,1), N are the dimension for optimizing space in formula (1).

New food source location update formula (3) are as follows:

In formula (2),For the random number between [- 1,1], k ∈ (1,2 ..., SN) and X_ij≠X_kj, new and old food source is logical The greedy rule of formula (3) is crossed to be selected, i.e., the fitness value of new and old food source, if New food source V_ijAdaptation Angle value is better than old food source X_ijFitness value, then use New food source V_ijPosition replace old food source X_ijPosition, it is otherwise old Food source X_ijIt is constant, it is still current foodstuff source position, while bee being employed to add 1 in the stagnation number of the food source.

Observation bee follows probability are as follows:

S1013, feature selecting can be considered as multi-objective optimization question, need a suitable objective function (present invention Middle title fitness function) optimization aim as algorithm.Wherein to realize two conflicting targets；It chooses the smallest Number of features and to greatest extent raising nicety of grading.The character subset quantity chosen every time is fewer, and nicety of grading is higher, it was demonstrated that It is proposed that model has better classifying quality.

Each solution is assessed according to the fitness function of proposition, which depends on searching algorithm and classifier, with Obtain the quantity of selected feature in the classification accuracy and solution of the number of features reconciliation of optimum solution.In order in each solution It is balanced between the feature quantity (minimum value) and classification accuracy (maximum value) selected in scheme, the fitness that we use Function formula (5) is as follows:

S1014, if there is abandoned food source (character subset), lead bee to be converted into investigation bee and generate new feature Collection.Judge whether current food source is enhanced in scheduled the number of iterations, if not being enhanced, needs to reinitialize food Material resource, otherwise return step S1012, obtains initial population optimal solution, i.e., optimal food source.

S1015, step S1012 to S1014 is repeated, until having updated population always all individuals.

The initial population of S102, the best foods source for selecting ABC algorithm initialization to be formed as CRO algorithm.Using chemistry Algorithm CRO Population Regeneration is reacted, and calculates the fitness value of each individual in the population using the fitness function of setting, is obtained Globally optimal solution into population.

S1021, setting initiation parameter, the kinetic energy initial value InitialKE of molecule；Central Energy buffer Buffer； Decomposition threshold α；Synthesize threshold value beta；；Unimolecule collides KE loss late KELossRate in vain；Molecular kinetic energy KE, potential energy PE；

The operation of 4 kinds of operators of S1022, CRO algorithm, unimolecule collide in vain；Decompose (decomposition reaction is monomolecular reaction, Corresponding global search process)；Intermolecular elastic collision (is two intermolecular chemical reactions, is the office that carries out to obtain new explanation Portion's search process)；Synthetic reaction (being intermolecular chemical reaction process).

When current iteration number i is less than maximum number of iterations itermax, CRO search is carried out to population, generates one at random Number r occurs monomolecular reaction if meeting decomposition condition NumHit-MinHit > α and otherwise carries out single point if r > MoleColl Sub- wall collision；If r≤MollColl, polymolecular chemical reaction occurs, synthesis condition meets KE < β, otherwise carries out intermolecular Invalid collision.

Four fundamental reactions of CRO algorithm:

(1) unimolecule collides in vain

It is a given molecular structure ω that unimolecule collides in vain, randomly selects a certain position of ω and sets it at random 0 or 1 is set to generate recruit ω ', if PE_ω’≤PE_ω, then subsequent chemical reaction is participated in the ω in ω ' substitution population change For process, otherwise ω is removed, as shown in Fig. 2 (b)

(2) it decomposes

Decomposition reaction is monomolecular reaction, corresponding global search process, as shown in Fig. 2 (c), generates two by a molecule ω A new molecular structure ω₁',ω'₂.Replicate ω is assigned to ω to the operator simultaneously first₁' and ω '₂, then respectively with random two Metavariable changes ω₁' and ω '₂The molecular structure of half, and decide whether to retain newly according to the molecular energy difference of reaction front and back The molecule of generation.

Energy difference is indicated with the tempBuff in formula (3), if tempBuff >=0 or tempbuff+ energy buffer Energy energyBuff >=0 of middle storage, just by ω '₁,ω′₂It is separately added into population and removes ω, otherwise delete ω '₁,ω′₂。 The purpose of decomposition reaction is to increase molecular diversity in feature selecting.

(3) intermolecular elastic collision

Intermolecular elastic collision is two intermolecular chemical reactions, is to obtain new explanation and the local search procedure that carries out. The operator randomly selects two molecule ω in population₁And ω₂, and a certain position randomly choosed in two molecules respectively carries out Mutation generates new molecular structure ω '₁With ω '₂.If reaction front and back energy variation meets the condition of formula (4), original is deleted First two molecules, and two new molecules are added in population；Otherwise newly generated two molecules are removed.

Intermolecular elastic collision is the same with the invalid crash response operator of unimolecule also to use unit gene mutation method, generation Shown in recruit's structure such as Fig. 2 (d):

(4) it synthesizes

Synthetic reaction is intermolecular chemical reaction process, is by combining two already present molecule ω₁And ω₂, with life At recruit ω '.The molecular structure of the middle half of ω ' comes from ω₁, the other half comes from ω₂Corresponding construction.React the molecule of front and back The difference of energy determines whether synthesis really executes.ω ' is then added in population and removes ω by formula (5) if the conditions are met₁ And ω₂, otherwise delete ω '.

It is shown that synthetic operation generates new molecule ω ' such as Fig. 2 (e)；

S1023, using the fitness value of each individual of KNN classifier evaluation, if the fitness value of the new individual is greater than Fitness value before individual update, then the individual before updating is replaced using the new individual, otherwise gives up the new individual；

S103, the smallest 5 molecules of selection potential energy PE are stored in elite molecule population, update elite after an iteration Molecule population, and obtained elite molecule is incorporated in population and carries out next iteration.

At this point, it should be noted that being needed after elite population is added in energy due to the principle of the reaction front and back conservation of energy Measure the energy for mitigating elite molecule in buffer buffer.

Buffer=buffer-PE (ω_elite)-KE(ω_elite)

In AB-CRO algorithm provided in an embodiment of the present invention, because there is the presence of fitness value, cause in search process On the one hand middle generation new individual, this phenomenon maintain population diversity, so that algorithm is had better ability of searching optimum, still On the other hand convergence speed of the algorithm is also slowed down, the accuracy of calculating is reduced in limited calculation times.In order to improve Convergence speed of the algorithm introduces elitism strategy after the completion of each iteration, in order to keep the scale of group constant, by elite individual Worst solution is replaced, it, can be minimum by fitness value in group of new generation if elite individual is added into group of new generation Individual eliminate.

S104, divided using 10 times of cross-validation methods (10-fold Cross Validation Technique) in conjunction with KNN Class device carrys out the search performance of verification algorithm, to assess the effect of classification.Original sample collection is randomly divided into 10 parts, it successively will wherein one Part is used as test set, and other parts are as training set, so that the average accuracy of 10 classifiers is calculated, to assess classification Effect.KNN (the closest method of K-Nearest Neighbor algorithm, K) is a kind of statistical sorter, to the spy of data The screening of sign variable is particularly effective.The distance between data characteristics to be sorted and training data feature are calculated first and are sorted.It takes Out apart from K recent training data feature.Then new samples are determined according to this K close training data feature generics Classification: assuming that they belong to one kind, then new sample also belongs to this class；Otherwise, each candidate categories is commented Point, the classification of new sample is determined according to certain rule.

S105, using S101 to S104 as an iteration, repeat S101 to S104, until current iteration number reaches The number of iterations of setting.

From above-described embodiment it is found that search process of the present invention is based on artificial bee colony algorithm ABC's and chemical reaction algorithm CRO Effective mixed method scans for.It is intended to select important feature based on the initialization procedure that ABC algorithm carries out feature ordering Collection, elitism strategy improve CRO convergence speed of the algorithm.In addition to other than initializing simplified redundancy feature, it is also contemplated that ABC is calculated The characteristics of local search ability of method is weak, easily falls into local optimum scans for process using CRO algorithm thus, to increase kind The diversity of group improves global search performance.

Four, experimental setup and interpretation of result

The data set that the present invention tests all is two classification problems, and each sample is mapped to the set of positive sample and negative sample In.Use TP (True Positive), FP (False Positive), TN (True Negative) and FN (False Negative the performance of this model) is assessed, classifier uses KNN classifier.

(1) data describe

The algorithm is tested on biomedical data collection disclosed in eight, these data sets come from: http: // Csse.szu.edu.cn/staff/zhuzx/Datasets.htm and http://leo.ugr.es/elvira/ DBCRepository/.The basic description of data set is as shown in table 1.

The description of 1 data set of table

Parameter setting

We construct orthogonal experiment and choose optimal parameter combination, and the orthogonal arrage of 4 level of this experimental design, 5 factor is L16 (45), 5 factors represent 5 parameters of selection, a total of 16 kinds of situations of 4 value ranges of each parameter selection.Table 2 has recorded The parameter value of all comparison algorithms.

2 parameter setting of table

(2) evaluation index

Ten times of cross validations

In order to verify the validity of this method, 10 times of cross-validation method (10-fold Cross are used in this experiment Validation Technique) search performance that carrys out verification algorithm in conjunction with KNN classifier, original sample collection is randomly divided into 10 Part, successively using a copy of it as test set, other parts are as training set, to calculate the average correct of 10 classifiers Rate, to assess the effect of classification.For the fairness of experiment, the experiment of all algorithms is also repeated 10 times, we are by all fingers Target average value is as final result such as table 3, shown in table 4.

Bigger its rate of missed diagnosis of explanation of the sensibility of the disease is relatively low in medical diagnosis, however the bigger explanation of specificity Its misdiagnosis rate is relatively low.

(3) evaluation index

By accuracy of the mean (Acc%) in experiment, sensibility Sensitivity, specific Specificity are average Character subset number (AvgN), standard deviation (std), average fitness value (Avgf%), runing time (Time) index are commented Estimate.

Accuracy Acc be exactly by point pair sample number divided by all sample numbers, measure classifying quality superiority and inferiority, description is such as Under:

Sensitivity indicate be in all positive examples by point pair ratio, measured classifier to the identification energy of positive example Power；

Specificity be in all negative examples by point pair ratio, measured classifier to the recognition capability of negative example；

(4) interpretation of result

For the validity of verification algorithm, by inventive algorithm AB-CRO and modified particle swarm optiziation (Modified Particle Swarm Algorithm), the improved algorithm ISFLA that leapfrogs (Improved Shuffled Frog Leaping Algorithm it) is compared with genetic algorithm GA (Genetic Algorithm).

Average characteristics number of subsets (AvgN)

Under eight kinds of biological data collection, algorithms of different can be judged in same data set by the character subset number of selection Under feature subset selection ability.It is as shown in Table 3 and Table 4 to analyze result, in terms of analysis result, less feature is selected to mean Eliminate the feature of redundancy and reduce search space, for average characteristics number, in addition to NervousSystem and Outside the two data sets of DLBCL-Stanford, the character subset of AB-CRO algorithm picks is the smallest.

Accuracy of the mean (Acc%)

Accuracy of the mean is also an important indicator, as shown in Table 3 and Table 4, it can be seen that on most of data sets Other algorithms are compared, and AB-CRO algorithm realizes best accuracy of the mean (Acc).

Standard deviation (std)

For the robustness of verification algorithm, this experiment acquires corresponding index accuracy of the mean and selects to put down by running 10 times The corresponding standard deviation of equal number of features.Standard deviation is to measure the amplitude of one group of number variation, it is evident that standard deviation is smaller, it was demonstrated that experiment As a result more stable.For data ALL-AML_train, ColonTumor, DLBCLOutcome, lungCan cer_train and The accuracy of DLBCL-NIH-train, AB_CRO algorithm possesses smaller standard deviation compared with other two kinds of algorithms, also illustrates this hair Bright algorithm is more stable.

Average fitness value (Avgf%)

Average fitness value and two can be good at balance characteristics selection maximum classification accuracy and subset it is most preferably long Spend the two targets.Fig. 3 illustrate the fitness value of algorithm of the AB-CRO compared with all as a result, clearly can be with from figure Find out for data set DLBCL-Stanford, lungCancer_train and Lung Cancer-Ontario, algorithm AB-CRO It is essentially the same with the fitness knot value of CRO.In addition to this, the fitness value of other five data set ABC-CRO is slightly above Other five algorithms.For the number of features of selection, what Fig. 3 was indicated is that the optimal feature subset finally chosen accounts for feature sum Percentage, it can be seen that feature of the algorithm AB-CRO of proposition on most of data set falls sharply most obvious (in addition to data set DLBCL-Stanford and NervousSystem).Although being adapted in Fig. 3 for most of data set algorithm AB-CRO Angle value is substantially similar to CRO, but the optimal feature subset that Fig. 4 illustrates AB-CRO algorithms selection is significantly less than CRO.The two Index further demonstrates, superiority of the inventive algorithm in terms of higher-dimension biomedical data feature selecting.

Runing time (Time)

Feature selecting is the dimension in order to reduce initial data, improves the efficiency of search mechanisms.It is contemplated herein that higher-dimension The time loss of the feature selecting of biological data collection.The runing time of algorithm depends on the convergence capabilities of algorithm and the rule of data set Mould.Fig. 5 illustrates the comparison of ABC, CRO and these three algorithms of AB-CRO of the invention runing time on 8 data sets, such as schemes Shown in 5, AB-CRO algorithm of the invention was 846 seconds DLBCL-NIH-train data set used times, and ABC algorithm is also required to 864 Second, large data sets runing time this for DLBCL-NIH-train (instance number: 160, number of features 7400) is better than ABC algorithm can control although the runing time on other 7 data sets is all slightly above other two kinds of algorithms 100 Within second, so fused AB-CRO algorithm is the demand that can satisfy real-time.

From table 3 it can be clearly seen that accuracy of the mean Acc, Sensitivity of AB-CRO algorithm and Specificity is higher than other meta-heuristic algorithms.For sensibility and specificity index, data set ALL-AML_train, DLBCL-Stanford and lungCancer_train reach 97% or more, for other diseases data set this three The rate of missed diagnosis and misdiagnosis rate of kind disease data set are all relatively low.

The experimental result that table 3 and other algorithms compare

Inventive algorithm AB-CRO is compared by we with original ABC algorithm and CRO algorithm, as shown in table 4.

AB-CRO algorithm and primal algorithm are on most of data sets, and not only precision increases, and are selecting then Also it is better than other algorithms in number of features, this further illustrates use ABC algorithm to carry out initialization and be better than at random just Beginningization.Although the number of features that AB_CRO chooses is less, number of features stability still exists not on partial data collection Foot, such as data set ColonTumor and NervousSystem etc..Sensitivity and Specificity two are referred to Mark, algorithm proposed by the present invention are apparently higher than primal algorithm in the experimental result of most of data sets.The standard of the two indexs Difference is also below ABC and CRO (in addition to data set ColonTumor, lungCancer_train and DLBCL-NIH-train).Institute It is higher than original ABC and CRO with the search performance of AB_CRO in summary.

The experimental result that table 4 and primal algorithm compare

(5) influence of the different classifications device to algorithm

By the way that compared with algorithms of different, AB-CRO algorithm achieves good classification performance in terms of disease data analysis. Herein in addition to using KNN classifier for assessing AB-CRO algorithm, the classifier SVM and NB of other two prevalence is also used to this Literary algorithm performance is assessed, and experimental result is shown in Table 5.

Table 5 it is clear that algorithm AB_CRO be based on classifier KNN and SVM ALL-AML_train, ColonTumor, NervousSystem, DLBCLOutcome, lungCancer_train and LungCancer-Ontario 6 The average characteristics number index of accuracy index and selection on a data set is consistent substantially, and illustrate inventive algorithm has Effect property.For the stability of the two classifiers, KNN classifier is more stable, and the accuracy and standard deviation of KNN classifier exist All data sets are the smallest.For specificity, KNN classifier is more dominant in most of data set for sensibility index Gesture.For all data sets, NB classifier has worst experimental result, and relative to KNN and SVM classifier, NB classification The stability that device is obtained a result is poor, selects suitable classifier that will improve the classification performance of algorithm.

Influence of the 5 different classifications device of table to algorithm

The advantages of technical solution provided by the present invention, is:

1, the ability of searching optimum of algorithm is improved

2, enhance population diversity.

3, it is avoiding falling into local optimum to a certain degree.

Disclosed above is only several specific embodiments of the invention, and still, the embodiment of the present invention is not limited to this, is appointed What what those skilled in the art can think variation should all fall into protection scope of the present invention.

Claims

1. a kind of high dimensional feature selection method for mixing ABC and CRO, which comprises the following steps:

Step 1, use based on artificial bee colony algorithm ABC with find the i.e. function fitness of best food source strategy to individual form Population initialized；

Step 2 updates initialization population using chemical reaction algorithm CRO, and calculates described kind using the fitness function of setting Individual fitness value, obtains the globally optimal solution in population in group；

Step 3 forms elite molecule population using elite retention strategy, and updates elite population after an iteration, will Obtained elite molecule, which is incorporated in population, carries out next iteration；

Step 5, using step 2 to step 4 as an iteration, repeat step 2 to step 4, until current iteration number reaches To the number of iterations of setting.

2. the high dimensional feature selection method of mixing ABC and CRO according to claim 1, which is characterized in that the step 1 Specifically:

Step 1.1 initializes population using artificial bee colony algorithm ABC, forms new initialization population and parameter, employs Bee, observation bee are equal to food source quantity M, and the number of bee colony is NP=M, and employing bee number is SN, generate NP=M food at random Source, ABC algorithm maximum number of iterations are itermax, and it is limit that maximum, which stagnates number,；

Step 1.2 carries out initialization to population using randomisation process and forms initial solution, are as follows: to the in the individual i in group J feature X_ijRandom number r, r a ∈ [0,1] is randomly generated, the feature X if the initialization probability P that random number r is less than setting_ijQuilt It chooses, otherwise X_ijIt is not selected；1 is set by selected feature for each individual, not selected feature is set as 0； The solution that group after initialization is formed is as initial solution；

Step 1.3 generates new explanation, and calculates the fitness value of new explanation using employing bee to carry out neighborhood search New food source, if The fitness value of new explanation is greater than the solution of original initialization, then original solution is updated with new explanation, otherwise keeps original solution constant； Preferable food source is selected with greedy algorithm, the probability that food source is selected is calculated, generates one around the food source selected A new food source calculates the fitness value of new food source, and the high substitution fitness value of fitness value is low, more new food Source；Record food source best so far；

Abandoned food source is judged whether there is, if so, bee is led to be converted into investigation bee, the new food of investigation bee random search Source；

Step 1.4 judges whether current food source is enhanced in scheduled the number of iterations, if not being enhanced, needs weight New initialization food source, otherwise return step 1.2, obtain initial population optimal solution, i.e., optimal food source.

3. the high dimensional feature selection method of mixing ABC and CRO according to claim 2, it is characterised in that: the step Bee is employed to generate SN food source at random according to formula (1) in 1.1

New food source location update formula (3) are as follows:

In formula (2),For the random number between [- 1,1], k ∈ (1,2 ..., SN) and X_ij≠X_kj, new and old food source passes through public affairs The greedy rule of formula (3) is selected, i.e., the fitness value of new and old food source, if New food source V_ijFitness value It is better than old food source X_ijFitness value, then use New food source V_ijPosition replace old food source X_ijPosition, otherwise old food Source X_ijIt is constant, it is still current foodstuff source position, while bee being employed to add 1 in the stagnation number of the food source；

Observation bee follows probability are as follows:

In formula (4), k ∈ (1,2 ..., SN) observes bee according to the food source information for employing bee to obtain according to wheel in this stage The mode of disk gambling selects food source to carry out food source search, if selection, which follows, is employed bee, basis employs the bee search phase Method carry out New food source search.

4. the high dimensional feature selection method of mixing ABC and CRO according to claim 1, which is characterized in that in step 2, adopt The population is updated with chemical reaction algorithm CRO to specifically include:

The initial population formed using ABC algorithm, is reacted, update is obtained by four basic operation operators of CRO algorithm Globally optimal solution.

5. the high dimensional feature selection method of mixing ABC and CRO according to claim 4, which is characterized in that in step 2, institute The CRO algorithm Population Regeneration stated specifically includes:

Step 2.1, setting initiation parameter, the kinetic energy initial value InitialKE of molecule；Central Energy buffer Buffer；Point Solve threshold alpha；Synthesize threshold value beta；Unimolecule collides KE loss late KELossRate in vain；Molecular kinetic energy KE, potential energy PE；

The operation of four kinds of step 2.2, CRO algorithm operators, unimolecule collide in vain；Decompose: decomposition reaction is monomolecular reaction, right Answer global search process；Intermolecular elastic collision: being two intermolecular chemical reactions, is to obtain new explanation and the part that carries out is searched Rope process；Synthetic reaction: being intermolecular chemical reaction process；

Step 2.3, using the fitness value of each individual of KNN classifier evaluation, if the fitness value of the new individual is greater than Otherwise fitness value before body updates gives up the new individual then using the individual before new individual replacement update；

6. the high dimensional feature selection method of mixing ABC and CRO according to claim 5, which is characterized in that in step 2.3, Function formula (5) used in fitness value using the i.e. selected character subset of each individual of KNN classifier evaluation are as follows:

WhereinAcc indicates the classification accuracy of sample, num_cPresentation class is correct Sample number, num_iThe sample number of presentation class mistake, n indicate the number of the corresponding selected feature of the sample of fitness value to be calculated Mesh, N are the numbers that the sample of fitness value to be calculated corresponds to all features, and δ is the weight of classification accuracy, and θ is feature selecting Weight ,+θ=1 δ.

7. the high dimensional feature selection method of mixing ABC and CRO according to claim 1, which is characterized in that the step 3 Specifically:

It selects the smallest 5 molecules of PE to be stored in elite molecule population, elite molecule population, and handle is updated after an iteration Obtained elite molecule, which is incorporated in population, carries out next iteration.

At this point, it should be noted that needing to delay in energy after elite population is added due to the principle of the reaction front and back conservation of energy It rushes in device buffer and mitigates the energy of elite molecule；

Buffer=buffer-PE (ω_elite)-KE(ω_elite)

8. the high dimensional feature selection method of mixing ABC and CRO according to claim 7, which is characterized in that the step 4 Specifically:

For the validity of verification result, we effectively verify sample using ten times of cross-validation methods and KNN classifier, Original sample collection is randomly divided into 10 parts, successively using a copy of it as test set, other parts are as training set, to calculate The average value of 10 results, to assess the effect of classification, KNN is a kind of statistical class set, the screening to the characteristic variable of data It is particularly effective.