CN110070916B

CN110070916B - Historical data-based cancer disease gene characteristic selection method

Info

Publication number: CN110070916B
Application number: CN201910355711.3A
Authority: CN
Inventors: 邱剑锋; 郭能; 张兴义; 苏延森
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2023-04-18
Anticipated expiration: 2039-04-29
Also published as: CN110070916A

Abstract

The invention discloses a cancer disease gene characteristic selection method based on historical data, which comprises the following steps: a: dividing the cancer disease gene data into a training set and a testing set; b: calculating the total average error rate after all the characteristics on the training set are selected; c: generating an initial population and constructing a fitness function; d: recording all the feature selection schemes into a feature tree, adjusting the distribution of the feature selection schemes, taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme, and returning results to a genetic operator and a guide search operator; e: guiding the evolution direction of the characteristic population; f: and D, judging a termination condition, if the termination condition is not met, repeating the steps D-F, and if the termination condition is met, outputting an optimal solution. The invention has the advantages that: the data dimensionality can be effectively reduced, the prediction accuracy is improved, and related genes of diseases such as cancer and the like are screened by combining a genetic algorithm through a feature tree, so that assistance is provided for diagnosis and treatment.

Description

Historical data-based cancer disease gene characteristic selection method

Technical Field

The invention relates to the technical field of disease pathogenic gene screening, in particular to a cancer disease gene characteristic selection method based on historical data.

Background

Cancer, the most common malignant tumor in human, seriously affects physical and mental health of people, and arouses high attention of experts and scholars in different fields. Patients with cancer develop a large amount of clinical data during the examination, treatment and administration. These data are very important for predicting the occurrence of malignant tumor and the development of disease. But these data tend to have the characteristics of high dimensionality, diversity, data confusion, and small samples.

With the rapid development of computer technology, it is thought to process these complex data by using a computer. By establishing a relevant machine learning prediction model, the occurrence and development of cancer are predicted, and relevant technical guidance and reference opinions are provided for the rehabilitation and postoperative medication of patients. Cancer disease gene data is characterized by high dimension, redundancy, abnormal deletion of data and the like, so before processing the data, cleaning work is usually required to be carried out on the data to remove irrelevant and redundant data.

Simple cleaning of data does not effectively solve the problem of high dimension of cancer disease gene data. Therefore, specific engineering methods and specific algorithms need to be developed for high-dimensional clinical data to achieve the purpose of reducing the dimension and improving the accuracy of the prediction model. The relevant methods should be designed and developed from the following two aspects, firstly, the model should have practicability, the prediction model needs a doctor to input the condition of a patient into a prediction platform when being embedded into an evaluation system, and if a large amount of sample data is needed for prediction, the burden of the doctor is increased, and the practical effect is not achieved. Secondly, the accuracy of model prediction needs to be considered, and an accurate prediction model can assist the diagnosis of doctors and provide suggestions for rehabilitation treatment of patients and reduction of recurrence risk of diseases.

Therefore, for the gene data of cancer diseases, a large number of methods are proposed to select data with important characteristics, reduce the dimensionality of the data and improve the accuracy of a prediction model. These methods mainly include a single factor analysis method, a recursive feature elimination method, a feature importance analysis method, and the like. However, the existing method often has the problems of unobvious dimensionality reduction effect, excessive redundant features, overhigh time complexity, low practicability and the like. Aiming at the problem of feature selection of high-dimensional data, the developed engineering method needs to reduce the time complexity of the method on the premise of ensuring the accuracy of a prediction model.

Disclosure of Invention

The invention aims to provide a cancer gene characteristic selection method which can effectively reduce data dimensionality and improve the accuracy of a prediction model.

The invention solves the technical problems through the following technical scheme:

a method for selecting cancer disease gene characteristics based on historical data comprising the steps of:

step A: dividing the cancer disease gene data into a training data set and a testing data set;

and B: calculating the total average error rate after all the characteristics on the training data set are selected by utilizing the five-fold intersection;

step C: randomly generating an initialization feature population, and constructing a fitness function for evaluating the feature selection scheme;

step D: sequentially recording the feature selection schemes in the feature population into the feature tree, adjusting the distribution of the feature selection schemes to obtain the adjusted feature population, and taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme;

step E: guiding the evolution direction of the characteristic population according to the genetic operator and the guide search operator; taking the feature population optimized by the feature tree as a parent population, and generating an offspring population by using a genetic operator; taking the position information of the feature tree where the optimal feature selection scheme is located as a search direction, and utilizing a guide search operator to strengthen local search;

step F: and D, judging a termination condition, if the termination condition is not reached, repeating the steps D-F, if the termination condition is reached, outputting an optimal feature selection scheme, and verifying the classification error rate of the optimal feature selection scheme on the test set.

Preferably, the method for calculating the average classification error rate after all the features on the training data set are selected by using the five-fold intersection method in step B is as follows:

equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rate

Wherein CE _k Representing the number of samples of classification errors in the feature selection scheme when the kth sample is taken as a test set, and N represents the total number of samples of the test set; five error rates are thus obtained, respectively denoted as Erro = { Erro = ₁ ,erro ₂ ,erro ₃ ,erro ₄ ,erro ₅ }, and calculates an average error rate @>

Preferably, the initialization feature population is a two-dimensional matrix, denoted as P = { X = { ₁ ，X ₂ ,…,X _N N is the population size; x _i ＝{x _i1 ,x _i2 ,…,x _in The symbol is a binary string, which represents a feature selection scheme; when x is _ij When =1, the j-th signature gene representing the i-th signature selection scheme is selected when x _ij If =0, it means that the corresponding feature is not selected; n represents the total number of characteristic genes.

Preferably, the fitness function f constructed in step C is:

wherein, N _p Representing the current feature selection scheme X _i Of the number of features selected in the image data,

selecting scheme X for a feature _i Average error rate of; alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;

wherein alpha is _t ∈[0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.

Preferably, the feature tree in step D is a binary tree, and the method for recording all feature selection schemes in the initial population into the feature tree includes the following steps;

step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P' for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, representing the feature selection scheme stored by the feature tree node pointed by the pointer node, and selecting a feature selection scheme X _i Inserting root node, optimal feature selection scheme X _best ＝X _i Selecting the feature to be scheme X _i Storing the data into the optimized population P', and jumping to the stepIi; if the feature tree is not empty, directly jumping to the step ii;

step ii: let i = i +1, the pointer node Cur _ node points to the root node, and the initialization repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and the current feature selection scheme X _i X of (a) _{i(depth(Cur_node))} ＝X _i (depth (Cur _ node)) =1, then pointer node Cur _ node points to the left child node if X _i (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;

step iii: selecting scheme X if current feature to be inserted _i The same as the pointer node Cur _ node, flag =1, and order

Skipping to step iv; judgment of X _i (depth (Cur _ node)) ≠ Cur _ node), jumping to step iv; when X is present _i (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), the pointer node Cur _ node and the current optimal feature selection scheme X are calculated separately _best Hamming distance d of ₁ ＝hamming(Cur_node,X _best ) And to-be-inserted feature selection scheme X _i And X _best Hamming distance d of ₂ ＝hamming(X _i ,X _best )；

If d is ₁ <d ₂ Let us order

Judgment f (Cur _ node)<f(X _best ) Then X _best If not, updating the function evaluation time t = t +1, and skipping to the step iv; if d is ₁ >d ₂ Then make it asserted>

Skipping to step iv;

step iv: if X _i (depth (Cur _ node)) =1, then X will be added _i Inserting a left child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; if X _i (depth (Cur _ node)) =0, then X will be added _i Inserting a right child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a left child node of the Cur _ node into the feature tree;

step v: mixing X _i Storing into optimized population P', if f (X) _i )<f(X _best ) Let X _best ＝X _i Otherwise, the evaluation times are unchanged, and the function is updated through t = t +1 after operation; if t is more than or equal to MaxEval or mod (i/N) =0, optimizing the population P' and the optimal feature selection scheme X _best And its depth osr in the feature tree are fed back to the genetic operator and the guided search operator, respectively, where mod (i/N) represents the i/N remainder; otherwise, jumping to the step i.

Preferably, the method in step E is: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; and replacing the first osr bits of the selected partial feature selection scheme with the optimal feature selection scheme X _best The value of the first osr bit to obtain the optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability p _c ∈[0.5,0.8]Probability of variation p _r And the guide search probability is 0.05-0.1.

Preferably, if t<Enabling P = 'P by MaxEval, emptying the optimized population P', and returning to the step D; if t is more than or equal to MaxEval, outputting an optimal feature selection scheme X _best And verifying the optimal feature selection scheme X on the test set _best The classification error rate of (2).

The cancer gene characteristic selection method based on historical data has the advantages that: the data dimensionality can be effectively reduced, the prediction accuracy is improved, and related genes of diseases such as cancer and the like are screened by combining a characteristic tree and a genetic algorithm, so that assistance is provided for diagnosis and treatment.

Drawings

FIG. 1 is a flow chart of a method for selecting a cancer disease gene signature based on historical data as provided by an embodiment of the invention;

FIG. 2 is a flow chart of an algorithm for optimizing populations through a feature tree according to an embodiment of the present invention;

FIG. 3 is a first schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a second method for optimizing a population through a feature tree according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

As shown in fig. 1, the present embodiment provides a method for selecting cancer disease gene characteristics based on historical data, comprising the steps of:

step A: dividing the cancer disease gene data into a training data set and a testing data set; the cancer disease gene data are divided into ten parts, 7 parts are used as a training data set, and 3 parts are used as a testing data set.

And B: and calculating the total average error rate after all the features on the training data set are selected by utilizing the five-fold cross, wherein the total average error rate specifically comprises the following steps: equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rate

Wherein CE _k The number of samples in the classification error in the feature selection scheme when the kth copy is used as the test set is shown,n represents the total number of samples in the test set; five error rates were obtained by calculation, respectively expressed as Erro = { Erro = } ₁ ,erro ₂ ,erro ₃ ,erro ₄ ,erro ₅ And is calculated to get an average error rate->

The above steps are all the prior art in the field of processing sample data by using genetic algorithm, and are not described in detail here.

Step C: randomly generating an initialization feature population, wherein the initialization feature population is a two-dimensional matrix and is represented by P = { X = ₁ ，X ₂ ,…,X _N N is the population size; x _i ＝{x _i1 ,x _i2 ,…,x _in Is a binary string, representing a feature selection scheme; when x is _ij When =1, the j-th signature gene representing the i-th signature selection scheme is selected when x _ij =0, this means that the corresponding feature is not selected; n represents the total number of characteristic genes;

constructing a fitness function f for evaluating the feature selection scheme:

wherein, N _p Indicating the current feature selection scheme X _i Of the number of features selected in (1),

selecting scheme X for a feature _i Average error rate of (2); alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;

Step D: sequentially recording the feature selection schemes in the feature population into the feature tree, adjusting the distribution of the feature selection schemes to obtain the adjusted feature population, and taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme; the feature tree is a binary tree, and the method specifically comprises the following steps:

step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P' for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, representing the feature selection scheme stored by the feature tree node pointed by the pointer node, and selecting a feature selection scheme X _i Inserting root node, optimal feature selection scheme X _best ＝X _i Selecting the feature to be scheme X _i Storing the optimized population P' and skipping to the step ii; if the feature tree is not empty, directly jumping to the step ii;

step ii: let i = i +1, the pointer node Cur _ node points to the root node, and the initial repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and the current feature selection scheme X _i X of _{i(depth(Cur_node))} ＝X _i (depth (Cur _ node)) =1, then pointer node Cur _ node points to the left child node if X _i (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;

step iii: selecting scheme X if current to-be-inserted feature _i The same as the pointer node Cur _ node, flag =1, and order

Skipping to step iv; judgment of X _i (depth (Cur _ node)) ≠ Cur _ node), jumping to step iv; when X is present _i (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), the pointer node Cur _ node and the current optimal feature selection scheme X are calculated separately _best Hamming distance d of ₁ ＝hamming(Cur_node,X _best ) And to-be-inserted feature selection scheme X _i And X _best Sea ofClear distance d ₂ ＝hamming(X _i ,X _best )；

If d is ₁ <d ₂ Let us order

Skipping to step iv;

step iv: if X _i (depth (Cur _ node)) =1, X will be added _i Inserting a left child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; if X _i (depth (Cur _ node)) =0, X will be _i Inserting a right child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a left child node of the Cur _ node into the feature tree;

step v: mixing X _i Storing into optimized population P', if f (X) _i )<f(X _best ) Let X _best ＝X _i Otherwise, the evaluation times are unchanged, and the function is updated through t = t +1 after operation; if t is more than or equal to MaxEval or mod (i/N) =0, optimizing the population P' and the optimal feature selection scheme X _best And its depth osr in the feature tree are fed back to the genetic operator and the guided search operator, respectively, where mod (i/N) represents the i/N remainder; if t is<And MaxEval, jumping to the step i.

And E, step E: guiding the evolution direction of the characteristic population according to a genetic operator and a guide search operator: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; and the first osr bits of the selected partial feature selection scheme are replaced to be the most significantPreferred feature selection scheme X _best The value of the first osr bit to obtain the optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability p _c ∈[0.5,0.8]Probability of variation p _r And the guide search probability is 0.05-0.1 according to the ratio of = 1/n.

Step F: judging the termination condition if t<Enabling P = 'P by MaxEval, emptying the optimized population P', and returning to the step D; if t is more than or equal to MaxEval, outputting an optimal feature selection scheme X _best And verifying the optimal feature selection scheme X on the test set _best The classification error rate of (2).

The data in table 1 is used as the initial population P to specifically explain the arrangement of the adjustment feature selection scheme on the feature tree:

table 1: initial population

First, parameters are set, and as can be seen from table 1, when N =5, the mutation probability p is _r =1/6, guide search probability 0.05, let p _c ＝0.5、α _t =0.2; selecting an empty feature tree, initializing the function evaluation times t =0, i =1, and maxeval =10, referring to fig. 3, and calculating X _i Inserting a root node of a null feature tree to make the optimal feature selection scheme X _best ＝X _i At this time, f (X) _best ) =5, mixing X _i And storing the obtained product into the empty optimization population P'.

Insertion feature selection scheme X ₂ ：

In step ii, i =2, the pointer node Cur _ node points to the root node, i.e. Cur _ node = X ₁ The repeat access flag =0; at this time, the Cur _ node is a leaf node, the step iii is skipped, and the current feature selection scheme to be inserted is X ₂ Unlike Cur _ node, flag =0, in this case depth (Cur _ node) =1, and X is known ₂ (depth (Cur _ node)) =1, and when the pointer node Cur _ node (depth (Cur _ node)) =1 is the same, it is determined that Cur _ node and X are each determined ₂ And X _best Hamming distance of (d) as shown in FIG. 4 ₁ ＝0，d ₂ =4; thus, d ₁ <d ₂ Execute by

Fast order x ₁₁ =0, the fitness needs to be re-evaluated because the pointer node Cur _ node changes, and f (X) is performed at this time _{Cur_node} )＝f(X _best )＝2，X _best Updating the function evaluation times t =1 without changing, and skipping to the step iv; at this time X ₂ (depth (Cur _ node)) =1, so will X ₂ Inserting the left child node as the pointer node Cur _ node and the modified pointer node Cur _ node as the right child node of the pointer node Cur _ node into the feature tree, and inserting X into the feature tree ₂ Store into optimized population P', at which time f (X) _best )＝2<f(X ₂ )＝4，X _best Updating the evaluation times of the function to t =2 without changing the evaluation times, wherein t is the time of the evaluation times<MaxEval with mod (i/N) ≠ 0, jumping to step i to continue the insert operation.

Insertion feature selection scheme X ₃ ：

Referring to fig. 5, when the feature tree is not empty, step ii is performed, i =3, the pointer node Cur _ node points to the root node, when there are two child nodes, and X of the current feature selection scheme is calculated ₃ (depth(Cur_node))＝x ₃₁ =1, pointer node points to left child node, i.e. Cur _ node = X ₂ When the pointer node Cur _ node is a leaf node, executing the step iii; since X ₃ Unlike Cur _ node, i.e., flag =0, in which case x ₃₂ =0, and depth (Cur _ node) of the pointer node = x ₂₂ =0, therefore the pointer nodes Cur _ node and X need to be calculated ₃ Are each independently of X _best Obtaining d from the Hamming distance of ₁ ＝4>d ₂ =2, order

I.e. x ₃₂ =1, go to step iv, add X ₃ Inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; mixing X ₃ Storing into population PAt this time, X _best No change, no need to re-evaluate fitness, so f (X) _best ) =2, for insertion X _i To obtain f (X) _best )＝2<f(X ₃ ) =5, therefore X _best And unchanged, updating the evaluation times of the function to t =3, wherein t is the time<MaxEval and mod (i/N) ≠ 0, jumping to step i to continue the insert operation.

Insertion feature selection scheme X ₄ ：

Referring to fig. 6, when the feature tree is not empty, step ii is performed, i =4, the pointer node Cur _ node points to the root node, when there are two child nodes, and X of the current feature selection scheme is calculated ₄ (depth(Cur_node))＝x ₄₁ =1, pointer node points to left child node, i.e. Cur _ node = X ₂ At this time, the pointer node Cur _ node still has two child nodes, and step ii continues to be executed, at this time, X ₄ (depth(Cur_node))＝x ₄₂ =0, pointer node points to the right child node, i.e. Cur _ node = X ₂ The pointer node Cur _ node is a leaf node, and step iii is executed; from Table 1, it can be seen that X is present at this time ₄ = Cur _ node, i.e. flag =1, let

I.e. x ₄₃ =1, jump to step iv, updated X ₄ As a left child node of the Cur _ node, inserting the Cur _ node as a right child node of the Cur _ node into the feature tree; mixing X ₄ Store into population P', at which time f (X) _best )＝2>f(X ₄ ) =1, let X _best ＝X ₄ Updating the function evaluation times to t =3, at which time t<MaxEval with mod (i/N) ≠ 0, jumping to step i to continue the insert operation.

Insertion feature selection scheme X ₅ ：

Referring to fig. 7, when the feature tree is not empty, step ii is performed, i =5, the pointer node Cur _ node is pointed to the root node, there are two child nodes, and X of the current feature selection scheme is calculated ₅ (depth(Cur_node))＝x ₅₁ =0, pointer node points to the right child node, i.e. Cur _ node = X ₁ At this time, the pointer node Cur _ nodeStep iii is performed for leaf nodes because Cur _ node and X ₅ In contrast, flag =0, at this time, cur _ node (depth) of the pointer node = x ₁₂ =0, and X to be inserted into the eigen scheme ₅ (depth(Cur_node))＝x ₅₂ =1, perform step iv, select feature to insert scheme X ₅ Inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; mixing X ₄ Store into population P', at which time f (X) _best )＝1<f(X ₅ )＝6，X _best And (4) updating the evaluation times of the function to t =4, wherein mod (i/N) =0, optimizing the population P', and selecting the optimal feature selection scheme X _best And the depth osr in the feature tree are respectively fed back to the genetic operator and the guide search operator, the population P 'is optimized by using the uniform crossing operation and the standard mutation operation in the existing genetic algorithm, the population evolution is guided to finally obtain the population' P, and at the moment, t<And MaxEval, enabling P = 'P, emptying the optimized population P', returning to the step D, and continuing to execute the insertion operation until a termination condition is met.

This example verifies the validity of the algorithm provided by this example by using the data set in the scimit-feature source feature selection library project at the university of arizona state. As shown in table 2, a Colon cancer data set Colon, a Lung cancer data set Lung, and a Glioma data set Glioma are selected for verification, and the error rate and the feature quantity are selected as evaluation indexes, specifically, the lower the error rate is, the smaller the feature quantity is, and the better the algorithm performance is.

Data set	Number of features	Number of samples	Number of classification
				Colon	2000	62	2
Lung	3312	203	5
				Glioma	4434	50	4

Table 2: cancer disease gene data

The classical genetic algorithm GA and the particle swarm optimization algorithm PSO are selected to be compared with the algorithm provided by the embodiment for a simulation experiment, the result is shown in table 3, the algorithm provided by the embodiment is defined as BSPGA, and the experimental result shows that the performance of the algorithm provided by the embodiment is obviously superior to that of the GA algorithm and the PSO algorithm.

Table 3: comparison of simulation experiments

The effectiveness and use scenario of the algorithm is described in the application starting with cancer data, but the algorithm is also effective for other diseases related to genetic variation.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions, improvements and the like made by those skilled in the art without departing from the spirit and principles of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A method for selecting a gene signature for a cancer disease based on historical data, comprising the steps of:

and B: calculating the total average error rate of all selected cancer disease gene characteristics on the training data set by utilizing five-fold intersection;

and C: randomly generating an initialized cancer disease gene characteristic population, and constructing a fitness function for evaluating a cancer disease gene characteristic selection scheme;

step D: sequentially recording the characteristic selection schemes in the cancer disease gene characteristic population into a cancer disease gene characteristic tree, adjusting the distribution of the characteristic selection schemes to obtain an adjusted cancer disease gene characteristic population, and taking the characteristic selection scheme with the minimum fitness value as an optimal characteristic selection scheme;

step E: guiding the evolution direction of the cancer disease gene characteristic population according to the genetic operator and the oriented search operator; taking the cancer disease gene characteristic population optimized by the cancer disease gene characteristic tree as a parent population, and generating an offspring population by using a genetic operator; taking the position information of the feature tree where the optimal cancer disease gene feature selection scheme is as a search direction, and reinforcing local search by utilizing a guide search operator;

step F: judging a termination condition, if the termination condition is not met, repeating the steps D-F, if the termination condition is met, outputting an optimal feature selection scheme, and verifying the classification error rate of the optimal feature selection scheme on a test set;

in step D, said cancer disease gene signature tree is a binary tree, and said method of recording all cancer disease gene signature selection schemes in the initial cancer disease gene signature population into the signature tree comprises the steps of:

step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P 'for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, indicating the feature selection scheme stored by the feature tree node pointed by the pointer node, inserting the feature selection scheme Xi into a root node, storing the feature selection scheme Xi into the optimized population P', and jumping to the step ii; if the feature tree is not empty, directly jumping to the step ii;

step ii: let i = i +1, pointer node Cur _ node point to root node, and initialize repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and Xi (depth (Cur _ node)) =1 for the current feature selection scheme Xi, then the pointer node Cur _ node points to the left child node, and if Xi (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;

step iii: if the current feature selection scheme Xi to be inserted is the same as the pointer node Cur _ node, flag =1, and the instruction is made

Go to step iv; when judging Xi (depth (Cur _ node)) ≠ Cur _ node (depth (Cur _ node)), jumping to step iv; when Xi (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), hamming distances d1= hamming (Cur _ node, xtest) of the pointer node Cur _ node and the current optimal feature selection scheme xtest and hamming distances d2= hamming (Xi, xtest) of the feature selection scheme Xi to be inserted and xtest are calculated, respectively; if d1 is less than d2, let

Judging that f (Cur _ node) < f (Xtest), if Xtest = Cur _ node, otherwise, updating the function evaluation times t = t +1, and skipping to the step iv; if d1 > d2, let->

Skipping to step iv; />

Step iv: if Xi (depth (Cur _ node)) =1, inserting Xi as the left child node of the Cur _ node into the feature tree, and inserting the Cur _ node as the right child node of the Cur _ node into the feature tree; if Xi (depth (Cur _ node)) =0, inserting Xi as the right child node of the Cur _ node into the feature tree, and inserting Cur _ node as the left child node of the Cur _ node into the feature tree;

and v: storing Xi into an optimized population P', if f (Xi) < f (Xtest), making Xtest = Xi, otherwise, keeping the Xtest = Xi, and updating the evaluation times of the function through t = t +1 after operation; if t is larger than or equal to MaxEval or mod (i/N) =0, feeding back the optimized population P', the optimal feature selection scheme Xtest and the depth osr thereof in the feature tree to a genetic operator and a guiding search operator respectively, wherein i/N remainder is represented by mod (i/N); otherwise, jumping to the step i.

2. A method of selecting gene signature of cancer disease based on historical data as claimed in claim 1 wherein: the method for calculating the average classification error rate after all the cancer disease gene characteristics on the training data set are selected by using the five-fold cross method in the step B comprises the following steps:

CEk represents the number of samples with classification errors in the feature selection scheme when the kth sample is used as the test set, and N represents the total number of samples in the test set; five error rates are thus obtained, which are respectively designated Erro = { Erro1, erro2, erro3, erro4, erro5}, and the average error rate = based on the values of the error rates are calculated>

3. A method of selecting cancer disease gene signatures based on historical data as claimed in claim 2 wherein: the initialized cancer disease gene signature population is a two-dimensional matrix represented as P = { X1, X2, …, XN }, where N is the population size; xi = { Xi1, xi2, …, xin } is a binary string, representing a feature selection scheme; when xij =1, a j-th feature gene representing an i-th feature selection scheme is selected, and when xij =0, it represents that a corresponding feature is not selected; n represents the total number of characteristic genes.

4. A method of selecting gene signature of cancer disease based on historical data as claimed in claim 3 wherein: the fitness function f constructed in the step C is as follows:

where Np represents the number of features selected in the current feature selection scheme Xi,

selecting an average error rate for the scheme Xi for the feature; alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;

wherein alpha t is epsilon [0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.

5. A method of selecting cancer disease gene signatures based on historical data as claimed in claim 1 wherein: the method in the step E comprises the following steps: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; replacing the front osr bits of the selected partial feature selection scheme with the front osr bits of the optimal feature selection scheme Xtest to obtain an optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability pc belongs to [0.5,0.8], the mutation probability pr =1/n, and the guide search probability is 0.05-0.1.

6. A method according to claim 5, wherein the selection of the genetic profile of the cancer disease is based on historical data, wherein: if t is less than MaxEval, making P = 'P, emptying the optimized population P', and returning to the step D; and if t is larger than or equal to MaxEval, outputting an optimal feature selection scheme Xtest, and verifying the classification error rate of the optimal feature selection scheme Xtest on the test set.