CN110070916B - Historical data-based cancer disease gene characteristic selection method - Google Patents
Historical data-based cancer disease gene characteristic selection method Download PDFInfo
- Publication number
- CN110070916B CN110070916B CN201910355711.3A CN201910355711A CN110070916B CN 110070916 B CN110070916 B CN 110070916B CN 201910355711 A CN201910355711 A CN 201910355711A CN 110070916 B CN110070916 B CN 110070916B
- Authority
- CN
- China
- Prior art keywords
- node
- cur
- feature
- population
- selection scheme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2111—Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Physiology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a cancer disease gene characteristic selection method based on historical data, which comprises the following steps: a: dividing the cancer disease gene data into a training set and a testing set; b: calculating the total average error rate after all the characteristics on the training set are selected; c: generating an initial population and constructing a fitness function; d: recording all the feature selection schemes into a feature tree, adjusting the distribution of the feature selection schemes, taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme, and returning results to a genetic operator and a guide search operator; e: guiding the evolution direction of the characteristic population; f: and D, judging a termination condition, if the termination condition is not met, repeating the steps D-F, and if the termination condition is met, outputting an optimal solution. The invention has the advantages that: the data dimensionality can be effectively reduced, the prediction accuracy is improved, and related genes of diseases such as cancer and the like are screened by combining a genetic algorithm through a feature tree, so that assistance is provided for diagnosis and treatment.
Description
Technical Field
The invention relates to the technical field of disease pathogenic gene screening, in particular to a cancer disease gene characteristic selection method based on historical data.
Background
Cancer, the most common malignant tumor in human, seriously affects physical and mental health of people, and arouses high attention of experts and scholars in different fields. Patients with cancer develop a large amount of clinical data during the examination, treatment and administration. These data are very important for predicting the occurrence of malignant tumor and the development of disease. But these data tend to have the characteristics of high dimensionality, diversity, data confusion, and small samples.
With the rapid development of computer technology, it is thought to process these complex data by using a computer. By establishing a relevant machine learning prediction model, the occurrence and development of cancer are predicted, and relevant technical guidance and reference opinions are provided for the rehabilitation and postoperative medication of patients. Cancer disease gene data is characterized by high dimension, redundancy, abnormal deletion of data and the like, so before processing the data, cleaning work is usually required to be carried out on the data to remove irrelevant and redundant data.
Simple cleaning of data does not effectively solve the problem of high dimension of cancer disease gene data. Therefore, specific engineering methods and specific algorithms need to be developed for high-dimensional clinical data to achieve the purpose of reducing the dimension and improving the accuracy of the prediction model. The relevant methods should be designed and developed from the following two aspects, firstly, the model should have practicability, the prediction model needs a doctor to input the condition of a patient into a prediction platform when being embedded into an evaluation system, and if a large amount of sample data is needed for prediction, the burden of the doctor is increased, and the practical effect is not achieved. Secondly, the accuracy of model prediction needs to be considered, and an accurate prediction model can assist the diagnosis of doctors and provide suggestions for rehabilitation treatment of patients and reduction of recurrence risk of diseases.
Therefore, for the gene data of cancer diseases, a large number of methods are proposed to select data with important characteristics, reduce the dimensionality of the data and improve the accuracy of a prediction model. These methods mainly include a single factor analysis method, a recursive feature elimination method, a feature importance analysis method, and the like. However, the existing method often has the problems of unobvious dimensionality reduction effect, excessive redundant features, overhigh time complexity, low practicability and the like. Aiming at the problem of feature selection of high-dimensional data, the developed engineering method needs to reduce the time complexity of the method on the premise of ensuring the accuracy of a prediction model.
Disclosure of Invention
The invention aims to provide a cancer gene characteristic selection method which can effectively reduce data dimensionality and improve the accuracy of a prediction model.
The invention solves the technical problems through the following technical scheme:
a method for selecting cancer disease gene characteristics based on historical data comprising the steps of:
step A: dividing the cancer disease gene data into a training data set and a testing data set;
and B: calculating the total average error rate after all the characteristics on the training data set are selected by utilizing the five-fold intersection;
step C: randomly generating an initialization feature population, and constructing a fitness function for evaluating the feature selection scheme;
step D: sequentially recording the feature selection schemes in the feature population into the feature tree, adjusting the distribution of the feature selection schemes to obtain the adjusted feature population, and taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme;
step E: guiding the evolution direction of the characteristic population according to the genetic operator and the guide search operator; taking the feature population optimized by the feature tree as a parent population, and generating an offspring population by using a genetic operator; taking the position information of the feature tree where the optimal feature selection scheme is located as a search direction, and utilizing a guide search operator to strengthen local search;
step F: and D, judging a termination condition, if the termination condition is not reached, repeating the steps D-F, if the termination condition is reached, outputting an optimal feature selection scheme, and verifying the classification error rate of the optimal feature selection scheme on the test set.
Preferably, the method for calculating the average classification error rate after all the features on the training data set are selected by using the five-fold intersection method in step B is as follows:
equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rateWherein CE k Representing the number of samples of classification errors in the feature selection scheme when the kth sample is taken as a test set, and N represents the total number of samples of the test set; five error rates are thus obtained, respectively denoted as Erro = { Erro = 1 ,erro 2 ,erro 3 ,erro 4 ,erro 5 }, and calculates an average error rate @>
Preferably, the initialization feature population is a two-dimensional matrix, denoted as P = { X = { 1 ,X 2 ,…,X N N is the population size; x i ={x i1 ,x i2 ,…,x in The symbol is a binary string, which represents a feature selection scheme; when x is ij When =1, the j-th signature gene representing the i-th signature selection scheme is selected when x ij If =0, it means that the corresponding feature is not selected; n represents the total number of characteristic genes.
Preferably, the fitness function f constructed in step C is:
wherein, N p Representing the current feature selection scheme X i Of the number of features selected in the image data,selecting scheme X for a feature i Average error rate of; alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;wherein alpha is t ∈[0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.
Preferably, the feature tree in step D is a binary tree, and the method for recording all feature selection schemes in the initial population into the feature tree includes the following steps;
step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P' for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, representing the feature selection scheme stored by the feature tree node pointed by the pointer node, and selecting a feature selection scheme X i Inserting root node, optimal feature selection scheme X best =X i Selecting the feature to be scheme X i Storing the data into the optimized population P', and jumping to the stepIi; if the feature tree is not empty, directly jumping to the step ii;
step ii: let i = i +1, the pointer node Cur _ node points to the root node, and the initialization repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and the current feature selection scheme X i X of (a) i(depth(Cur_node)) =X i (depth (Cur _ node)) =1, then pointer node Cur _ node points to the left child node if X i (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;
step iii: selecting scheme X if current feature to be inserted i The same as the pointer node Cur _ node, flag =1, and orderSkipping to step iv; judgment of X i (depth (Cur _ node)) ≠ Cur _ node), jumping to step iv; when X is present i (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), the pointer node Cur _ node and the current optimal feature selection scheme X are calculated separately best Hamming distance d of 1 =hamming(Cur_node,X best ) And to-be-inserted feature selection scheme X i And X best Hamming distance d of 2 =hamming(X i ,X best );
If d is 1 <d 2 Let us orderJudgment f (Cur _ node)<f(X best ) Then X best If not, updating the function evaluation time t = t +1, and skipping to the step iv; if d is 1 >d 2 Then make it asserted> Skipping to step iv;
step iv: if X i (depth (Cur _ node)) =1, then X will be added i Inserting a left child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; if X i (depth (Cur _ node)) =0, then X will be added i Inserting a right child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a left child node of the Cur _ node into the feature tree;
step v: mixing X i Storing into optimized population P', if f (X) i )<f(X best ) Let X best =X i Otherwise, the evaluation times are unchanged, and the function is updated through t = t +1 after operation; if t is more than or equal to MaxEval or mod (i/N) =0, optimizing the population P' and the optimal feature selection scheme X best And its depth osr in the feature tree are fed back to the genetic operator and the guided search operator, respectively, where mod (i/N) represents the i/N remainder; otherwise, jumping to the step i.
Preferably, the method in step E is: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; and replacing the first osr bits of the selected partial feature selection scheme with the optimal feature selection scheme X best The value of the first osr bit to obtain the optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability p c ∈[0.5,0.8]Probability of variation p r And the guide search probability is 0.05-0.1.
Preferably, if t<Enabling P = 'P by MaxEval, emptying the optimized population P', and returning to the step D; if t is more than or equal to MaxEval, outputting an optimal feature selection scheme X best And verifying the optimal feature selection scheme X on the test set best The classification error rate of (2).
The cancer gene characteristic selection method based on historical data has the advantages that: the data dimensionality can be effectively reduced, the prediction accuracy is improved, and related genes of diseases such as cancer and the like are screened by combining a characteristic tree and a genetic algorithm, so that assistance is provided for diagnosis and treatment.
Drawings
FIG. 1 is a flow chart of a method for selecting a cancer disease gene signature based on historical data as provided by an embodiment of the invention;
FIG. 2 is a flow chart of an algorithm for optimizing populations through a feature tree according to an embodiment of the present invention;
FIG. 3 is a first schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a second method for optimizing a population through a feature tree according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
As shown in fig. 1, the present embodiment provides a method for selecting cancer disease gene characteristics based on historical data, comprising the steps of:
step A: dividing the cancer disease gene data into a training data set and a testing data set; the cancer disease gene data are divided into ten parts, 7 parts are used as a training data set, and 3 parts are used as a testing data set.
And B: and calculating the total average error rate after all the features on the training data set are selected by utilizing the five-fold cross, wherein the total average error rate specifically comprises the following steps: equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rateWherein CE k The number of samples in the classification error in the feature selection scheme when the kth copy is used as the test set is shown,n represents the total number of samples in the test set; five error rates were obtained by calculation, respectively expressed as Erro = { Erro = } 1 ,erro 2 ,erro 3 ,erro 4 ,erro 5 And is calculated to get an average error rate->The above steps are all the prior art in the field of processing sample data by using genetic algorithm, and are not described in detail here.
Step C: randomly generating an initialization feature population, wherein the initialization feature population is a two-dimensional matrix and is represented by P = { X = 1 ,X 2 ,…,X N N is the population size; x i ={x i1 ,x i2 ,…,x in Is a binary string, representing a feature selection scheme; when x is ij When =1, the j-th signature gene representing the i-th signature selection scheme is selected when x ij =0, this means that the corresponding feature is not selected; n represents the total number of characteristic genes;
constructing a fitness function f for evaluating the feature selection scheme:
wherein, N p Indicating the current feature selection scheme X i Of the number of features selected in (1),selecting scheme X for a feature i Average error rate of (2); alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;wherein alpha is t ∈[0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.
Step D: sequentially recording the feature selection schemes in the feature population into the feature tree, adjusting the distribution of the feature selection schemes to obtain the adjusted feature population, and taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme; the feature tree is a binary tree, and the method specifically comprises the following steps:
step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P' for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, representing the feature selection scheme stored by the feature tree node pointed by the pointer node, and selecting a feature selection scheme X i Inserting root node, optimal feature selection scheme X best =X i Selecting the feature to be scheme X i Storing the optimized population P' and skipping to the step ii; if the feature tree is not empty, directly jumping to the step ii;
step ii: let i = i +1, the pointer node Cur _ node points to the root node, and the initial repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and the current feature selection scheme X i X of i(depth(Cur_node)) =X i (depth (Cur _ node)) =1, then pointer node Cur _ node points to the left child node if X i (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;
step iii: selecting scheme X if current to-be-inserted feature i The same as the pointer node Cur _ node, flag =1, and orderSkipping to step iv; judgment of X i (depth (Cur _ node)) ≠ Cur _ node), jumping to step iv; when X is present i (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), the pointer node Cur _ node and the current optimal feature selection scheme X are calculated separately best Hamming distance d of 1 =hamming(Cur_node,X best ) And to-be-inserted feature selection scheme X i And X best Sea ofClear distance d 2 =hamming(X i ,X best );
If d is 1 <d 2 Let us orderJudgment f (Cur _ node)<f(X best ) Then X best If not, updating the function evaluation time t = t +1, and skipping to the step iv; if d is 1 >d 2 Then make it asserted> Skipping to step iv;
step iv: if X i (depth (Cur _ node)) =1, X will be added i Inserting a left child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; if X i (depth (Cur _ node)) =0, X will be i Inserting a right child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a left child node of the Cur _ node into the feature tree;
step v: mixing X i Storing into optimized population P', if f (X) i )<f(X best ) Let X best =X i Otherwise, the evaluation times are unchanged, and the function is updated through t = t +1 after operation; if t is more than or equal to MaxEval or mod (i/N) =0, optimizing the population P' and the optimal feature selection scheme X best And its depth osr in the feature tree are fed back to the genetic operator and the guided search operator, respectively, where mod (i/N) represents the i/N remainder; if t is<And MaxEval, jumping to the step i.
And E, step E: guiding the evolution direction of the characteristic population according to a genetic operator and a guide search operator: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; and the first osr bits of the selected partial feature selection scheme are replaced to be the most significantPreferred feature selection scheme X best The value of the first osr bit to obtain the optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability p c ∈[0.5,0.8]Probability of variation p r And the guide search probability is 0.05-0.1 according to the ratio of = 1/n.
Step F: judging the termination condition if t<Enabling P = 'P by MaxEval, emptying the optimized population P', and returning to the step D; if t is more than or equal to MaxEval, outputting an optimal feature selection scheme X best And verifying the optimal feature selection scheme X on the test set best The classification error rate of (2).
The data in table 1 is used as the initial population P to specifically explain the arrangement of the adjustment feature selection scheme on the feature tree:
table 1: initial population
First, parameters are set, and as can be seen from table 1, when N =5, the mutation probability p is r =1/6, guide search probability 0.05, let p c =0.5、α t =0.2; selecting an empty feature tree, initializing the function evaluation times t =0, i =1, and maxeval =10, referring to fig. 3, and calculating X i Inserting a root node of a null feature tree to make the optimal feature selection scheme X best =X i At this time, f (X) best ) =5, mixing X i And storing the obtained product into the empty optimization population P'.
Insertion feature selection scheme X 2 :
In step ii, i =2, the pointer node Cur _ node points to the root node, i.e. Cur _ node = X 1 The repeat access flag =0; at this time, the Cur _ node is a leaf node, the step iii is skipped, and the current feature selection scheme to be inserted is X 2 Unlike Cur _ node, flag =0, in this case depth (Cur _ node) =1, and X is known 2 (depth (Cur _ node)) =1, and when the pointer node Cur _ node (depth (Cur _ node)) =1 is the same, it is determined that Cur _ node and X are each determined 2 And X best Hamming distance of (d) as shown in FIG. 4 1 =0,d 2 =4; thus, d 1 <d 2 Execute byFast order x 11 =0, the fitness needs to be re-evaluated because the pointer node Cur _ node changes, and f (X) is performed at this time Cur_node )=f(X best )=2,X best Updating the function evaluation times t =1 without changing, and skipping to the step iv; at this time X 2 (depth (Cur _ node)) =1, so will X 2 Inserting the left child node as the pointer node Cur _ node and the modified pointer node Cur _ node as the right child node of the pointer node Cur _ node into the feature tree, and inserting X into the feature tree 2 Store into optimized population P', at which time f (X) best )=2<f(X 2 )=4,X best Updating the evaluation times of the function to t =2 without changing the evaluation times, wherein t is the time of the evaluation times<MaxEval with mod (i/N) ≠ 0, jumping to step i to continue the insert operation.
Insertion feature selection scheme X 3 :
Referring to fig. 5, when the feature tree is not empty, step ii is performed, i =3, the pointer node Cur _ node points to the root node, when there are two child nodes, and X of the current feature selection scheme is calculated 3 (depth(Cur_node))=x 31 =1, pointer node points to left child node, i.e. Cur _ node = X 2 When the pointer node Cur _ node is a leaf node, executing the step iii; since X 3 Unlike Cur _ node, i.e., flag =0, in which case x 32 =0, and depth (Cur _ node) of the pointer node = x 22 =0, therefore the pointer nodes Cur _ node and X need to be calculated 3 Are each independently of X best Obtaining d from the Hamming distance of 1 =4>d 2 =2, orderI.e. x 32 =1, go to step iv, add X 3 Inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; mixing X 3 Storing into population PAt this time, X best No change, no need to re-evaluate fitness, so f (X) best ) =2, for insertion X i To obtain f (X) best )=2<f(X 3 ) =5, therefore X best And unchanged, updating the evaluation times of the function to t =3, wherein t is the time<MaxEval and mod (i/N) ≠ 0, jumping to step i to continue the insert operation.
Insertion feature selection scheme X 4 :
Referring to fig. 6, when the feature tree is not empty, step ii is performed, i =4, the pointer node Cur _ node points to the root node, when there are two child nodes, and X of the current feature selection scheme is calculated 4 (depth(Cur_node))=x 41 =1, pointer node points to left child node, i.e. Cur _ node = X 2 At this time, the pointer node Cur _ node still has two child nodes, and step ii continues to be executed, at this time, X 4 (depth(Cur_node))=x 42 =0, pointer node points to the right child node, i.e. Cur _ node = X 2 The pointer node Cur _ node is a leaf node, and step iii is executed; from Table 1, it can be seen that X is present at this time 4 = Cur _ node, i.e. flag =1, letI.e. x 43 =1, jump to step iv, updated X 4 As a left child node of the Cur _ node, inserting the Cur _ node as a right child node of the Cur _ node into the feature tree; mixing X 4 Store into population P', at which time f (X) best )=2>f(X 4 ) =1, let X best =X 4 Updating the function evaluation times to t =3, at which time t<MaxEval with mod (i/N) ≠ 0, jumping to step i to continue the insert operation.
Insertion feature selection scheme X 5 :
Referring to fig. 7, when the feature tree is not empty, step ii is performed, i =5, the pointer node Cur _ node is pointed to the root node, there are two child nodes, and X of the current feature selection scheme is calculated 5 (depth(Cur_node))=x 51 =0, pointer node points to the right child node, i.e. Cur _ node = X 1 At this time, the pointer node Cur _ nodeStep iii is performed for leaf nodes because Cur _ node and X 5 In contrast, flag =0, at this time, cur _ node (depth) of the pointer node = x 12 =0, and X to be inserted into the eigen scheme 5 (depth(Cur_node))=x 52 =1, perform step iv, select feature to insert scheme X 5 Inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; mixing X 4 Store into population P', at which time f (X) best )=1<f(X 5 )=6,X best And (4) updating the evaluation times of the function to t =4, wherein mod (i/N) =0, optimizing the population P', and selecting the optimal feature selection scheme X best And the depth osr in the feature tree are respectively fed back to the genetic operator and the guide search operator, the population P 'is optimized by using the uniform crossing operation and the standard mutation operation in the existing genetic algorithm, the population evolution is guided to finally obtain the population' P, and at the moment, t<And MaxEval, enabling P = 'P, emptying the optimized population P', returning to the step D, and continuing to execute the insertion operation until a termination condition is met.
This example verifies the validity of the algorithm provided by this example by using the data set in the scimit-feature source feature selection library project at the university of arizona state. As shown in table 2, a Colon cancer data set Colon, a Lung cancer data set Lung, and a Glioma data set Glioma are selected for verification, and the error rate and the feature quantity are selected as evaluation indexes, specifically, the lower the error rate is, the smaller the feature quantity is, and the better the algorithm performance is.
Data set | Number of features | Number of samples | Number of classification |
Colon | 2000 | 62 | 2 |
Lung | 3312 | 203 | 5 |
Glioma | 4434 | 50 | 4 |
Table 2: cancer disease gene data
The classical genetic algorithm GA and the particle swarm optimization algorithm PSO are selected to be compared with the algorithm provided by the embodiment for a simulation experiment, the result is shown in table 3, the algorithm provided by the embodiment is defined as BSPGA, and the experimental result shows that the performance of the algorithm provided by the embodiment is obviously superior to that of the GA algorithm and the PSO algorithm.
Table 3: comparison of simulation experiments
The effectiveness and use scenario of the algorithm is described in the application starting with cancer data, but the algorithm is also effective for other diseases related to genetic variation.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions, improvements and the like made by those skilled in the art without departing from the spirit and principles of the present invention should fall within the protection scope defined by the claims of the present invention.
Claims (6)
1. A method for selecting a gene signature for a cancer disease based on historical data, comprising the steps of:
step A: dividing the cancer disease gene data into a training data set and a testing data set;
and B: calculating the total average error rate of all selected cancer disease gene characteristics on the training data set by utilizing five-fold intersection;
and C: randomly generating an initialized cancer disease gene characteristic population, and constructing a fitness function for evaluating a cancer disease gene characteristic selection scheme;
step D: sequentially recording the characteristic selection schemes in the cancer disease gene characteristic population into a cancer disease gene characteristic tree, adjusting the distribution of the characteristic selection schemes to obtain an adjusted cancer disease gene characteristic population, and taking the characteristic selection scheme with the minimum fitness value as an optimal characteristic selection scheme;
step E: guiding the evolution direction of the cancer disease gene characteristic population according to the genetic operator and the oriented search operator; taking the cancer disease gene characteristic population optimized by the cancer disease gene characteristic tree as a parent population, and generating an offspring population by using a genetic operator; taking the position information of the feature tree where the optimal cancer disease gene feature selection scheme is as a search direction, and reinforcing local search by utilizing a guide search operator;
step F: judging a termination condition, if the termination condition is not met, repeating the steps D-F, if the termination condition is met, outputting an optimal feature selection scheme, and verifying the classification error rate of the optimal feature selection scheme on a test set;
in step D, said cancer disease gene signature tree is a binary tree, and said method of recording all cancer disease gene signature selection schemes in the initial cancer disease gene signature population into the signature tree comprises the steps of:
step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P 'for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, indicating the feature selection scheme stored by the feature tree node pointed by the pointer node, inserting the feature selection scheme Xi into a root node, storing the feature selection scheme Xi into the optimized population P', and jumping to the step ii; if the feature tree is not empty, directly jumping to the step ii;
step ii: let i = i +1, pointer node Cur _ node point to root node, and initialize repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and Xi (depth (Cur _ node)) =1 for the current feature selection scheme Xi, then the pointer node Cur _ node points to the left child node, and if Xi (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;
step iii: if the current feature selection scheme Xi to be inserted is the same as the pointer node Cur _ node, flag =1, and the instruction is madeGo to step iv; when judging Xi (depth (Cur _ node)) ≠ Cur _ node (depth (Cur _ node)), jumping to step iv; when Xi (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), hamming distances d1= hamming (Cur _ node, xtest) of the pointer node Cur _ node and the current optimal feature selection scheme xtest and hamming distances d2= hamming (Xi, xtest) of the feature selection scheme Xi to be inserted and xtest are calculated, respectively; if d1 is less than d2, letJudging that f (Cur _ node) < f (Xtest), if Xtest = Cur _ node, otherwise, updating the function evaluation times t = t +1, and skipping to the step iv; if d1 > d2, let->Skipping to step iv; />
Step iv: if Xi (depth (Cur _ node)) =1, inserting Xi as the left child node of the Cur _ node into the feature tree, and inserting the Cur _ node as the right child node of the Cur _ node into the feature tree; if Xi (depth (Cur _ node)) =0, inserting Xi as the right child node of the Cur _ node into the feature tree, and inserting Cur _ node as the left child node of the Cur _ node into the feature tree;
and v: storing Xi into an optimized population P', if f (Xi) < f (Xtest), making Xtest = Xi, otherwise, keeping the Xtest = Xi, and updating the evaluation times of the function through t = t +1 after operation; if t is larger than or equal to MaxEval or mod (i/N) =0, feeding back the optimized population P', the optimal feature selection scheme Xtest and the depth osr thereof in the feature tree to a genetic operator and a guiding search operator respectively, wherein i/N remainder is represented by mod (i/N); otherwise, jumping to the step i.
2. A method of selecting gene signature of cancer disease based on historical data as claimed in claim 1 wherein: the method for calculating the average classification error rate after all the cancer disease gene characteristics on the training data set are selected by using the five-fold cross method in the step B comprises the following steps:
equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rateCEk represents the number of samples with classification errors in the feature selection scheme when the kth sample is used as the test set, and N represents the total number of samples in the test set; five error rates are thus obtained, which are respectively designated Erro = { Erro1, erro2, erro3, erro4, erro5}, and the average error rate = based on the values of the error rates are calculated>
3. A method of selecting cancer disease gene signatures based on historical data as claimed in claim 2 wherein: the initialized cancer disease gene signature population is a two-dimensional matrix represented as P = { X1, X2, …, XN }, where N is the population size; xi = { Xi1, xi2, …, xin } is a binary string, representing a feature selection scheme; when xij =1, a j-th feature gene representing an i-th feature selection scheme is selected, and when xij =0, it represents that a corresponding feature is not selected; n represents the total number of characteristic genes.
4. A method of selecting gene signature of cancer disease based on historical data as claimed in claim 3 wherein: the fitness function f constructed in the step C is as follows:
where Np represents the number of features selected in the current feature selection scheme Xi,selecting an average error rate for the scheme Xi for the feature; alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;wherein alpha t is epsilon [0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.
5. A method of selecting cancer disease gene signatures based on historical data as claimed in claim 1 wherein: the method in the step E comprises the following steps: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; replacing the front osr bits of the selected partial feature selection scheme with the front osr bits of the optimal feature selection scheme Xtest to obtain an optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability pc belongs to [0.5,0.8], the mutation probability pr =1/n, and the guide search probability is 0.05-0.1.
6. A method according to claim 5, wherein the selection of the genetic profile of the cancer disease is based on historical data, wherein: if t is less than MaxEval, making P = 'P, emptying the optimized population P', and returning to the step D; and if t is larger than or equal to MaxEval, outputting an optimal feature selection scheme Xtest, and verifying the classification error rate of the optimal feature selection scheme Xtest on the test set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910355711.3A CN110070916B (en) | 2019-04-29 | 2019-04-29 | Historical data-based cancer disease gene characteristic selection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910355711.3A CN110070916B (en) | 2019-04-29 | 2019-04-29 | Historical data-based cancer disease gene characteristic selection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110070916A CN110070916A (en) | 2019-07-30 |
CN110070916B true CN110070916B (en) | 2023-04-18 |
Family
ID=67369518
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910355711.3A Active CN110070916B (en) | 2019-04-29 | 2019-04-29 | Historical data-based cancer disease gene characteristic selection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110070916B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243662B (en) * | 2020-01-15 | 2023-04-21 | 云南大学 | Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost |
CN112580606B (en) * | 2020-12-31 | 2022-11-08 | 安徽大学 | Large-scale human body behavior identification method based on clustering grouping |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN109242100A (en) * | 2018-09-07 | 2019-01-18 | 浙江财经大学 | A kind of Niche Genetic method on multiple populations for feature selecting |
-
2019
- 2019-04-29 CN CN201910355711.3A patent/CN110070916B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN109242100A (en) * | 2018-09-07 | 2019-01-18 | 浙江财经大学 | A kind of Niche Genetic method on multiple populations for feature selecting |
Non-Patent Citations (1)
Title |
---|
范方云 ; 孙俊 ; .基于BQPSO算法的癌症特征基因选择与分类.江南大学学报(自然科学版).2015,(01),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN110070916A (en) | 2019-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ching et al. | Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data | |
Tong et al. | Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis | |
Boulesteix et al. | IPF-LASSO: integrative L1-penalized regression with penalty factors for prediction based on multi-omics data | |
AU2016209478B2 (en) | Systems and methods for response prediction to chemotherapy in high grade bladder cancer | |
KR20210018333A (en) | Method and apparatus for multimodal prediction using a trained statistical model | |
Zhang et al. | DeepHE: Accurately predicting human essential genes based on deep learning | |
JP2020501240A (en) | Methods and systems for predicting DNA accessibility in pan-cancer genomes | |
Lu et al. | Predicting human lncRNA-disease associations based on geometric matrix completion | |
CN110070916B (en) | Historical data-based cancer disease gene characteristic selection method | |
Liang et al. | A deep learning framework to predict tumor tissue-of-origin based on copy number alteration | |
Wang et al. | Multiple surrogates and offspring-assisted differential evolution for high-dimensional expensive problems | |
Fu et al. | An improved multi-objective marine predator algorithm for gene selection in classification of cancer microarray data | |
Zhang et al. | MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations | |
Han et al. | Inverse-weighted survival games | |
CN110009128A (en) | Industry public opinion index prediction technique, device, computer equipment and storage medium | |
Li et al. | Assisted gene expression‐based clustering with AWNCut | |
Hem et al. | Robust modeling of additive and nonadditive variation with intuitive inclusion of expert knowledge | |
Vidyasagar | Probabilistic methods in cancer biology | |
JP7490576B2 (en) | Method and apparatus for multimodal prediction using trained statistical models - Patents.com | |
Kasianov et al. | Interspecific comparison of gene expression profiles using machine learning | |
Wu et al. | A practical algorithm based on particle swarm optimization for haplotype reconstruction | |
Mapelli | Multi-outcome feature selection via anomaly detection autoencoders: an application to radiogenomics in breast cancer patients | |
Ng | High-dimensional varying-coefficient models for genomic studies | |
Qiu | Imputation and Predictive Modeling with Biomedical Multi-Scale Data | |
Strauch | Improving diagnosis of genetic disease through computational investigation of splicing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |