CN110070916A

CN110070916A - A kind of Cancerous disease gene expression characteristics selection method based on historical data

Info

Publication number: CN110070916A
Application number: CN201910355711.3A
Authority: CN
Inventors: 邱剑锋; 郭能; 张兴义; 苏延森
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-07-30
Anticipated expiration: 2039-04-29
Also published as: CN110070916B

Abstract

The Cancerous disease gene expression characteristics selection method based on historical data that the invention discloses a kind of, comprising the following steps: Cancerous disease gene data A: is divided into training set and test set；B: the overall average error rate after all features are all selected on training set is calculated；C: generating initial population, constructs fitness function；D: all feature selecting schemes are recorded in characteristics tree, adjust the distribution of feature selecting scheme, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme, return result to genetic operator and guiding searching operators；E: the Evolutionary direction of guide features population；F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, exports optimal solution.The present invention has the advantages that can be effectively reduced data dimension, forecasting accuracy is improved, by characteristics tree, is screened in conjunction with the related gene that genetic algorithm is the diseases such as cancer, provides auxiliary for diagnosing and treating.

Description

A kind of Cancerous disease gene expression characteristics selection method based on historical data

Technical field

The present invention relates to the Disease-causing gene screening technique field of disease more particularly to a kind of cancer diseases based on historical data Ospc gene feature selection approach.

Background technique

The cancer malignant tumour most commonly seen as the mankind, seriously affects the physical and mental health of people, causes different field Experts and scholars great attention.Since cancer patient can generate a large amount of clinical number during inspection, treatment and medication According to.These data for prediction malignant tumour there is a situation where the development of, disease to have very important significance.But these data Often have the characteristics that high dimension, diversification, data corruption and small sample.

With the rapid development of computer technology, it is thought that going to handle these complex datas using computer.By building Relevant machine learning prediction model is found, predicts the generation and development of cancer, the rehabilitation and postoperative for patient The relevant technologies guidance and advisory opinion are provided.The exception for higher-dimension, redundancy and the data that Cancerous disease gene data has is scarce The features such as mistake, therefore before handling these data, it usually needs cleaning is carried out to data, gets rid of uncorrelated and redundancy Data.

The simple cleaning of data not can effectively solve the high-dimensional problem of Cancerous disease gene data.Therefore, for High-dimensional clinical data also needs to develop specific engineering method and targetedly algorithm, to reach dimensionality reduction and improve prediction mould The purpose of type accuracy.It should go to design and develop correlation technique in terms of following two, first be exactly that model should have reality With property, prediction model needs doctor that the case where patient is input to predicting platform in being embedded into assessment system, if prediction needs A large amount of sample data is wanted, the burden of doctor is often will increase, does not have practical function.Secondly, it is also necessary to consider model prediction Accuracy, accurate prediction model can assist the diagnosis of doctor, rehabilitation to patient and reduce disease palindromia Risk provides suggestion.

Therefore, for Cancerous disease gene data, it can choose there has been proposed a large amount of method and provided important feature Data, reduce the dimension of data, improve the accuracy rate of prediction model.These methods mainly have single factor analysis method, recurrence Feature removing method and feature importance analysis method etc..However existing method often exists that dimensionality reduction effect is unobvious, redundancy is special It goes on a punitive expedition the problems such as more, time complexity is excessively high and not very practical.For the feature selection issues of high dimensional data, develop Engineering method needs under the premise of can guarantee prediction model accuracy, reduces the time complexity of method.

Summary of the invention

Technical problem to be solved by the present invention lies in providing one kind to can be effectively reduced data dimension, prediction model is improved The cancer gene feature selection approach of accuracy.

The present invention is to solve above-mentioned technical problem by the following technical programs:

A kind of Cancerous disease gene expression characteristics selection method based on historical data, comprising the following steps:

Step A: Cancerous disease gene data is divided into training dataset and test data set；

Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets；

Step C: it is random to generate initialization feature population, it is configured to the fitness function of evaluating characteristic selection scheme；

Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, adjustment feature selecting scheme Distribution, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme；

Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population；Characteristics tree is optimized Feature population as parent population, utilize genetic operator to generate progeny population；By the feature where optimal feature selection scheme The location information of tree reinforces local search as the direction of search, using guiding searching operators；

Step F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, output Optimal feature selection scheme verifies the classification error rate of optimal feature selection scheme on test set.

Preferably, after all being selected described in step B using all features on five folding interior extrapolation methods calculating training dataset The method for error rate of averagely classifying are as follows:

Training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts are obtained as training set Take error rateWherein CE_kIndicate kth part as test set when feature selecting scheme in classification error sample Number, N indicate the total sample number of test set；To obtain five error rates, it is denoted as Erro={ erro respectively₁,erro₂,erro₃, erro₄,erro₅, and vision response test is calculated

Preferably, the initialization feature population is two-dimensional matrix, is expressed as P={ X₁, X₂,···,X_N, wherein N For Population Size；X_i={ x_i1,x_i2,···,x_inIt is string of binary characters, indicate a kind of feature selecting scheme；Work as x_ij=1 When, it indicates that j-th of characterizing gene of ith feature selection scheme is selected, works as x_ijWhen=0, indicate that corresponding feature is not chosen It selects；N indicates characterizing gene sum.

Preferably, the fitness function f constructed in step C are as follows:

Wherein, N_pIndicate current signature selection scheme X_iIn be selected feature quantity,It is characterized selection scheme X_i Vision response test；α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function；Wherein α_t∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most Big function evaluates number.

Preferably, characteristics tree described in step D is binary tree, all feature selecting schemes by initial population The method being recorded in characteristics tree includes the following steps；

Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i= 1, the setting value of MaxEval is inputted, the optimization population P ' for being used for the feature selecting scheme after storage optimization is emptied, initialization refers to Pin knot point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature Selection scheme X_iIt is inserted into root node, optimal feature selection scheme X_best=X_i, by feature selecting scheme X_iIt stores to optimization population In P ', go to step ii；If characteristics tree non-empty jumps directly to step ii；

Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0； If Cur_node is leaf node, go to step iii；If there are two child nodes by pointer node Cur_node, and currently special Levy selection scheme X_iX_{i(depth(Cur_node))}=X_i(depth (Cur_node))=1, then pointer node Cur_node is directed toward left Child nodes, if X_i(depth (Cur_node))=0, then Cur_node is directed toward right child nodes；Wherein depth (Cur_ Node the depth of pointer node Cur_node) is indicated；Current procedures are repeated until pointer node Cur_node is leaf node；

Step iii: if being currently inserted into feature selecting scheme X_iIdentical as pointer node Cur_node, then flag=1, enablesGo to step iv；Judge X_i(depth(Cur_ When node)) ≠ Cur_node (depth (Cur_node)), go to step iv；Work as X_i(depth (Cur_node))=Cur_ When node (depth (Cur_node)), pointer node Cur_node and current optimal feature selection scheme X are calculated separately_best's Hamming distances d₁=hamming (Cur_node, X_best) and be inserted into feature selecting scheme X_iWith X_bestHamming distances d₂= hamming(X_i,X_best)；

If d₁< d₂, enableJudge f (Cur_node) < f (X_best), then X_best=Cur_node, otherwise constant, renewal function evaluates number t=t+1, jumps to step Rapid iv；If d₁> d₂, then enable Go to step iv；

Step iv: if X_i(depth (Cur_node))=1, then by X_iLeft child nodes as Cur_node are inserted into feature In tree, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree；If X_i(depth (Cur_node))=0, Then by X_iRight child nodes as Cur_node are inserted into characteristics tree, using Cur_node as the left child nodes of Cur_node It is inserted into characteristics tree；

Step v: by X_iIt stores into optimization population P ', if f (X_i) < f (X_best), enable X_best=X_i, otherwise constant, fortune Number is evaluated by t=t+1 renewal function after calculation；If t >=MaxEval or mod (i/N)=0, population P ', optimal spy will be optimized Levy selection scheme X_bestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod respectively (i/N) the i/N remainder indicated；Otherwise, go to step i.

Preferably, method described in step E are as follows: parent population of the population P ' as genetic operator will be optimized, executed uniform Crossover operation and normal bit mutation operation obtain sub- population P ", by being oriented to searching probability from selected section feature in sub- population P " Selection scheme executes guiding search；And optimal feature selection is substituted for by preceding osr of the Partial Feature selection scheme selected Scheme X_bestPreceding osr values, obtain optimal population P " '；Initial population P is merged with optimal population P " ', using elite Environmental selection strategy, selection enter follow-on population `P；Wherein crossover probability p_c∈ [0.5,0.8], mutation probability p_r=1/n, Being oriented to searching probability is 0.05~0.1.

Preferably, if t < MaxEval, enables P=`P, optimization population P ' is emptied, return step D；If t >=MaxEval, Export optimal feature selection scheme X_best, and optimal feature selection scheme X is verified on test set_bestClassification error rate.

A kind of the advantages of cancer gene feature selection approach based on historical data provided by the invention, is: can be effective Data dimension is reduced, forecasting accuracy is improved, is carried out by the related gene that characteristics tree combination genetic algorithm is the diseases such as cancer Screening, provides auxiliary for diagnosing and treating.

Detailed description of the invention

Fig. 1 is the stream of the Cancerous disease gene expression characteristics selection method based on historical data provided by the embodiment of the present invention Cheng Tu；

Fig. 2 is the algorithm flow chart for passing through feature tree optimization population provided by the embodiment of the present invention；

Fig. 3 is the method schematic one by feature tree optimization population that the embodiment of the present invention provides；

Fig. 4 is the method schematic two by feature tree optimization population that the embodiment of the present invention provides；

Fig. 5 is the method schematic three by feature tree optimization population that the embodiment of the present invention provides；

Fig. 6 is the method schematic four by feature tree optimization population that the embodiment of the present invention provides；

Fig. 7 is the method schematic five for passing through feature tree optimization population provided by the embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in further detail.

As shown in Figure 1, present embodiments providing a kind of Cancerous disease gene expression characteristics selection method based on historical data, wrap Include following steps:

Step A: Cancerous disease gene data is divided into training dataset and test data set；Specially by Cancerous disease base Because data are divided into ten parts, 7 parts are used as training dataset, and 3 parts are used as test data set.

Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets, tool Body are as follows: training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts wrong as training set acquisition Accidentally rateWherein CE_kIndicate kth part as test set when feature selecting scheme in classification error sample number, N Indicate the total sample number of test set；Five error rates are obtained by calculating, are expressed as Erro={ erro₁,erro₂, erro₃,erro₄,erro₅, and vision response test is calculatedAbove-mentioned steps are that this field uses at genetic algorithm The prior art for managing sample data, herein without being discussed in detail.

Step C: random to generate initialization feature population, the initialization feature population is two-dimensional matrix, is expressed as P= {X₁, X₂,···,X_N, wherein N is Population Size； X_i={ x_i1,x_i2,···,x_inIt is string of binary characters, indicate one Kind feature selecting scheme；Work as x_ijWhen=1, indicates that j-th of characterizing gene of ith feature selection scheme is selected, work as x_ij=0 When, indicate that corresponding feature is unselected；N indicates characterizing gene sum；

It is configured to the fitness function f of evaluating characteristic selection scheme:

Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, adjustment feature selecting scheme Distribution, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme；Institute Stating characteristics tree is binary tree, the method specifically:

Step v: by X_iIt stores into optimization population P ', if f (X_i) < f (X_best), enable X_best=X_i, otherwise constant, fortune Number is evaluated by t=t+1 renewal function after calculation；If t >=MaxEval or mod (i/N)=0, population P ', optimal spy will be optimized Levy selection scheme X_bestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod respectively (i/N) the i/N remainder indicated；If t < MaxEval, go to step i.

Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population: population P ' work will be optimized For the parent population of genetic operator, executes uniform crossover operator and normal bit mutation operation obtains sub- population P ", searched by guiding Rope probability executes guiding search from selected section feature selecting scheme in sub- population P "；And the Partial Feature selecting party that will be selected Preceding osr of case are substituted for optimal feature selection scheme X_bestPreceding osr values, obtain optimal population P " '；By initial kind Group P merges with optimal population P " ', and using elite environmental selection strategy, selection enters follow-on population `P；Wherein crossover probability p_c∈ [0.5,0.8], mutation probability p_r=1/n, guiding searching probability are 0.05~0.1.

Step F: judging termination condition, if t < MaxEval, enables P=`P, optimization population P ' is emptied, return step D；If T >=MaxEval exports optimal feature selection scheme X_best, and optimal feature selection scheme X is verified on test set_bestClassification Error rate.

Below using the data of table 1 as initial population P to arrangement mode of the adjustment feature selecting scheme in characteristics tree into Row specific explanations:

Table 1: initial population

Parameter is configured first, as shown in Table 1, N=5, then mutation probability p_r=1/6, being oriented to searching probability is 0.05, enable p_c=0.5, α_t=0.2；Characteristics tree has been chosen, initialization function evaluates number t=0, i=1, MaxEval=10, With reference to Fig. 3, by X_iIt has been inserted into the root node of characteristics tree, has enabled optimal feature selection scheme X_best=X_i, f (X at this time_best)=5, By X_iIt stores into sky optimization population P '.

It is inserted into feature selecting scheme X₂:

In step ii, i=2, pointer node Cur_node are directed toward root node, i.e. Cur_node=X₁, repeated accesses mark Remember flag=0；Cur_node is leaf node at this time, and go to step iii, and being currently inserted into feature selecting scheme is X₂, with Cur_node is different, flag=0, at this time depth (Cur_node)=1, it is known that X₂(depth (Cur_node))=1, with finger Pin knot point Cur_node (depth (Cur_node))=1 is identical, then judges Cur_node and X respectively₂With X_bestHamming distances, As shown in figure 4, d₁=0, d₂=4；Therefore, d₁< d₂, executeEven x₁₁=0, by Variation has been given birth in pointer node Cur_node, has needed to reappraise fitness, at this time f (X_{Cur_node})=f (X_best)=2, X_best It does not change, renewal function evaluates number t=1, and go to step iv；X at this time₂(depth (Cur_node))=1, therefore By X₂As the left child nodes of pointer node Cur_node, modified pointer node Cur_node as pointer node Cur_ In the right child nodes insertion characteristics tree of node, by X₂It stores into optimization population P ', at this time f (X_best< f (the X of)=2₂)=4, X_bestIt does not change, renewal function evaluates number to t=2, at this time t < MaxEval and mod (i/N) ≠ 0, and go to step i Continue to execute insertion operation.

It is inserted into feature selecting scheme X₃:

With reference to Fig. 5, characteristics tree non-empty, executes step ii at this time, and i=3, pointer node Cur_node are directed toward root node, this When there are two child nodes, calculate the X of current signature selection scheme₃(depth (Cur_node))=x₃₁=1, pointer node refers to To left child nodes, i.e. Cur_node=X₂, pointer node Cur_node is leaf node at this time, executes step iii；Because of X₃ It is different from Cur_node, i.e. flag=0, at this time x₃₂=0, and depth (Cur_node) (depth (Cur_ of pointer node Node))=x₂₂=0, it is therefore desirable to calculate pointer node Cur_node and X₃Respectively with X_bestHamming distances, obtain d₁=4 > d₂=2, it enablesThat is x₃₂=1, iv is entered step, By X₃As the left child nodes of Cur_node, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree； By X₃It stores into population P ', at this time X_bestDo not change, does not need to reappraise fitness, therefore f (X_best)=2, to insertion X_iEvaluation fitness, obtain f (X_best< f (the X of)=2₃)=5, therefore X_bestIt is constant, renewal function evaluation number to t=3, this When t < MaxEval and mod (i/N) ≠ 0, the i that gos to step continue to execute insertion operation.

It is inserted into feature selecting scheme X₄:

With reference to Fig. 6, characteristics tree non-empty, executes step ii at this time, and i=4, pointer node Cur_node are directed toward root node, this When there are two child nodes, calculate the X of current signature selection scheme₄(depth (Cur_node))=x₄₁=1, pointer node refers to To left child nodes, i.e. Cur_node=X₂, still there are two child nodes by pointer node Cur_node at this time, continue to execute Step ii, at this time X₄(depth (Cur_node))=x₄₂=0, pointer node is directed toward right child nodes, i.e. Cur_node=X₂, Pointer node Cur_node is leaf node, executes step iii；X at this time is known that according to table 1₄=Cur_node, i.e. flag =1, it enablesThat is x₄₃=1, go to step iv, after update X₄As the left child nodes of Cur_node, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree； By X₄It stores into population P ', at this time f (X_best> f (the X of)=2₄)=1, enables X_best=X₄, renewal function evaluation number to t=3, T < MaxEval and mod (i/N) ≠ 0, the i that gos to step continue to execute insertion operation at this time.

It is inserted into feature selecting scheme X₅:

With reference to Fig. 7, characteristics tree non-empty, executes step ii, i=5 at this time, and pointer node Cur_node is directed toward root node, There are two child nodes, calculate the X of current signature selection scheme₅(depth (Cur_node))=x₅₁=0, pointer node is directed toward Right child nodes, i.e. Cur_node=X₁, pointer node Cur_node is leaf node at this time, step iii is executed, because Cur_node and X₅Difference, flag=0, at this time Cur_node (depth (Cur_node))=x of pointer node₁₂=0, and to It is inserted into the X of featured aspects₅(depth (Cur_node))=x₅₂=1, step iv is executed, feature selecting scheme X will be inserted into₅Make For the left child nodes of Cur_node, it is inserted into Cur_node as the right child nodes of Cur_node in characteristics tree；By X₄It deposits It stores up into population P ', at this time f (X_best)=1 < f (X₅)=6, X_bestConstant, renewal function evaluates number to t=4, at this time mod (i/N)=0 population P ', optimal feature selection scheme X, will be optimized_bestAnd its depth osr in characteristics tree is fed back to respectively Genetic operator and guiding searching operators, by using in existing genetic algorithm uniform crossover operator and normal bit mutation operation at Reason optimization population P ', guidance Evolution of Population finally obtain population `P, and t < MaxEval, enables P=`P at this time, and optimization population P ' is set Sky, return step D continues to execute insertion operation, until meeting termination condition.

The present embodiment founds university scikit-feature open source feature selecting library item mesh by using State of Arizona, US In data set verify the validity of algorithm provided by the present embodiment.As shown in table 2, a colon cancer data set is had chosen Colon, a lung cancer data set Lung and a nervousness tumor data set Glioma are verified, and select error rate and feature Quantity is lower as evaluation index, specially error rate, and feature quantity is smaller, and algorithm performance is better.

Table 2: Cancerous disease gene data

It chooses classical Genetic Algorithms and particle swarm optimization algorithm PSO and algorithm provided by the embodiment carries out emulation in fact Comparison is tested, the results are shown in Table 3, wherein algorithm provided in this embodiment is defined as BSPGA, can be seen that from experimental result The performance of algorithm provided in this embodiment is substantially better than GA algorithm and PSO algorithm.

Table 3: emulation experiment comparison

The application starts with to illustrate the validity and usage scenario of algorithm with cancer data, but for other and genetic mutation This algorithm is equally effective for relevant disease.

Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects It describes in detail bright, it should be understood that the above is only a specific embodiment of the present invention, is not intended to restrict the invention, not It is any modification that those of ordinary skill in the art make the present invention, equivalent under the premise of being detached from the spirit and principles in the present invention Replacement, improvement etc., should all fall within the protection scope that claims of the present invention determines.

Claims

1. a kind of Cancerous disease gene expression characteristics selection method based on historical data, it is characterised in that: the following steps are included:

Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, point of adjustment feature selecting scheme Cloth, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme.

Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population；The spy that characteristics tree is optimized Population is levied as parent population, generates progeny population using genetic operator；By the characteristics tree where optimal feature selection scheme Location information reinforces local search as the direction of search, using guiding searching operators；

Step F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, exports optimal Feature selecting scheme verifies the classification error rate of optimal feature selection scheme on test set.

2. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 1, feature exist In: the average classification error after all being selected described in step B using all features on five folding interior extrapolation methods calculating training dataset The method of rate are as follows:

Training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts wrong as training set acquisition Accidentally rateWherein CE_kIndicate kth part as test set when feature selecting scheme in classification error sample number, N Indicate the total sample number of test set；To obtain five error rates, it is denoted as Erro={ erro respectively₁,erro₂,erro₃, erro₄,erro₅, and vision response test is calculated

3. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 2, feature exist In: the initialization feature population is two-dimensional matrix, is expressed as P={ X₁, X₂,…,X_N, wherein N is Population Size；X_i= {x_i1,x_i2,…,x_inIt is string of binary characters, indicate a kind of feature selecting scheme；Work as x_ijWhen=1, ith feature selection is indicated J-th of characterizing gene of scheme is selected, and x is worked as_ijWhen=0, indicate that corresponding feature is unselected；N indicates characterizing gene sum.

4. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 3, feature exist In: the fitness function f constructed in step C are as follows:

Wherein, N_pIndicate current signature selection scheme X_iIn be selected feature quantity,It is characterized selection scheme X_iIt is flat Equal error rate；α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function；Wherein α_t∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most Big function evaluates number.

5. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 4, feature exist In: characteristics tree described in step D is binary tree, and characteristics tree is recorded in all feature selecting schemes by initial population In method include the following steps；

Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i=1, defeated The setting value for entering MaxEval empties the optimization population P ' for being used for the feature selecting scheme after storage optimization, initialization pointers section Point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature selecting Scheme X_iIt is inserted into root node, optimal feature selection scheme X_best=X_i, by feature selecting scheme X_iIt stores into optimization population P ', Go to step ii；If characteristics tree non-empty jumps directly to step ii；

Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0；If Cur_node is leaf node, and go to step iii；If child nodes that there are two pointer node Cur_node, and current signature Selection scheme X_iX_{i(depth(Cur_node))}=X_i(depth (Cur_node))=1, then pointer node Cur_node is directed toward left child Node, if X_i(depth (Cur_node))=0, then Cur_node is directed toward right child nodes；Wherein depth (Cur_node) table Show the depth of pointer node Cur_node；Current procedures are repeated until pointer node Cur_node is leaf node；

If d₁<d₂, enableJudge f (Cur_ node)<f(X_best), then X_best=Cur_node, otherwise constant, renewal function evaluates number t=t+1, and go to step iv；If d₁>d₂, then enable Go to step iv；

Step iv: if X_i(depth (Cur_node))=1, then by X_iLeft child nodes as Cur_node are inserted into characteristics tree In, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree；If X_i(depth (Cur_node))=0, then By X_iAs in the right child nodes insertion characteristics tree of Cur_node, inserted Cur_node as the left child nodes of Cur_node Enter in characteristics tree；

Step v: by X_iIt stores into optimization population P ', if f (X_i)<f(X_best), enable X_best=X_i, it is otherwise constant, lead to after operation Cross t=t+1 renewal function evaluation number；If t >=MaxEval or mod (i/N)=0, population P ', optimal feature selection will be optimized Scheme X_bestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod (i/N) respectively The i/N remainder of expression；Otherwise, go to step i.

6. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 5, feature exist In: method described in step E are as follows: parent population of the population P ' as genetic operator will be optimized, execute uniform crossover operator and mark Level mutation operation obtains sub- population P ", is executed by guiding searching probability from selected section feature selecting scheme in sub- population P " Guiding search；And optimal feature selection scheme X is substituted for by preceding osr of the Partial Feature selection scheme selected_bestBefore Osr values obtain optimal population P " '；Initial population P is merged with optimal population P " ', using elite environmental selection strategy, Selection enters follow-on population `P；Wherein crossover probability p_c∈ [0.5,0.8], mutation probability p_r=1/n is oriented to searching probability It is 0.05~0.1.

7. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 6, feature exist In: if t < MaxEval, P=`P is enabled, optimization population P ' is emptied, return step D；If t >=MaxEval, output optimal characteristics choosing Select scheme X_best, and optimal feature selection scheme X is verified on test set_bestClassification error rate.