CN110070916B - Historical data-based cancer disease gene characteristic selection method - Google Patents

Historical data-based cancer disease gene characteristic selection method Download PDF

Info

Publication number
CN110070916B
CN110070916B CN201910355711.3A CN201910355711A CN110070916B CN 110070916 B CN110070916 B CN 110070916B CN 201910355711 A CN201910355711 A CN 201910355711A CN 110070916 B CN110070916 B CN 110070916B
Authority
CN
China
Prior art keywords
node
cur
feature
population
selection scheme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910355711.3A
Other languages
Chinese (zh)
Other versions
CN110070916A (en
Inventor
邱剑锋
郭能
张兴义
苏延森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN201910355711.3A priority Critical patent/CN110070916B/en
Publication of CN110070916A publication Critical patent/CN110070916A/en
Application granted granted Critical
Publication of CN110070916B publication Critical patent/CN110070916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Physiology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a cancer disease gene characteristic selection method based on historical data, which comprises the following steps: a: dividing the cancer disease gene data into a training set and a testing set; b: calculating the total average error rate after all the characteristics on the training set are selected; c: generating an initial population and constructing a fitness function; d: recording all the feature selection schemes into a feature tree, adjusting the distribution of the feature selection schemes, taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme, and returning results to a genetic operator and a guide search operator; e: guiding the evolution direction of the characteristic population; f: and D, judging a termination condition, if the termination condition is not met, repeating the steps D-F, and if the termination condition is met, outputting an optimal solution. The invention has the advantages that: the data dimensionality can be effectively reduced, the prediction accuracy is improved, and related genes of diseases such as cancer and the like are screened by combining a genetic algorithm through a feature tree, so that assistance is provided for diagnosis and treatment.

Description

Historical data-based cancer disease gene characteristic selection method
Technical Field
The invention relates to the technical field of disease pathogenic gene screening, in particular to a cancer disease gene characteristic selection method based on historical data.
Background
Cancer, the most common malignant tumor in human, seriously affects physical and mental health of people, and arouses high attention of experts and scholars in different fields. Patients with cancer develop a large amount of clinical data during the examination, treatment and administration. These data are very important for predicting the occurrence of malignant tumor and the development of disease. But these data tend to have the characteristics of high dimensionality, diversity, data confusion, and small samples.
With the rapid development of computer technology, it is thought to process these complex data by using a computer. By establishing a relevant machine learning prediction model, the occurrence and development of cancer are predicted, and relevant technical guidance and reference opinions are provided for the rehabilitation and postoperative medication of patients. Cancer disease gene data is characterized by high dimension, redundancy, abnormal deletion of data and the like, so before processing the data, cleaning work is usually required to be carried out on the data to remove irrelevant and redundant data.
Simple cleaning of data does not effectively solve the problem of high dimension of cancer disease gene data. Therefore, specific engineering methods and specific algorithms need to be developed for high-dimensional clinical data to achieve the purpose of reducing the dimension and improving the accuracy of the prediction model. The relevant methods should be designed and developed from the following two aspects, firstly, the model should have practicability, the prediction model needs a doctor to input the condition of a patient into a prediction platform when being embedded into an evaluation system, and if a large amount of sample data is needed for prediction, the burden of the doctor is increased, and the practical effect is not achieved. Secondly, the accuracy of model prediction needs to be considered, and an accurate prediction model can assist the diagnosis of doctors and provide suggestions for rehabilitation treatment of patients and reduction of recurrence risk of diseases.
Therefore, for the gene data of cancer diseases, a large number of methods are proposed to select data with important characteristics, reduce the dimensionality of the data and improve the accuracy of a prediction model. These methods mainly include a single factor analysis method, a recursive feature elimination method, a feature importance analysis method, and the like. However, the existing method often has the problems of unobvious dimensionality reduction effect, excessive redundant features, overhigh time complexity, low practicability and the like. Aiming at the problem of feature selection of high-dimensional data, the developed engineering method needs to reduce the time complexity of the method on the premise of ensuring the accuracy of a prediction model.
Disclosure of Invention
The invention aims to provide a cancer gene characteristic selection method which can effectively reduce data dimensionality and improve the accuracy of a prediction model.
The invention solves the technical problems through the following technical scheme:
a method for selecting cancer disease gene characteristics based on historical data comprising the steps of:
step A: dividing the cancer disease gene data into a training data set and a testing data set;
and B: calculating the total average error rate after all the characteristics on the training data set are selected by utilizing the five-fold intersection;
step C: randomly generating an initialization feature population, and constructing a fitness function for evaluating the feature selection scheme;
step D: sequentially recording the feature selection schemes in the feature population into the feature tree, adjusting the distribution of the feature selection schemes to obtain the adjusted feature population, and taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme;
step E: guiding the evolution direction of the characteristic population according to the genetic operator and the guide search operator; taking the feature population optimized by the feature tree as a parent population, and generating an offspring population by using a genetic operator; taking the position information of the feature tree where the optimal feature selection scheme is located as a search direction, and utilizing a guide search operator to strengthen local search;
step F: and D, judging a termination condition, if the termination condition is not reached, repeating the steps D-F, if the termination condition is reached, outputting an optimal feature selection scheme, and verifying the classification error rate of the optimal feature selection scheme on the test set.
Preferably, the method for calculating the average classification error rate after all the features on the training data set are selected by using the five-fold intersection method in step B is as follows:
equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rate
Figure BDA0002045358430000031
Wherein CE k Representing the number of samples of classification errors in the feature selection scheme when the kth sample is taken as a test set, and N represents the total number of samples of the test set; five error rates are thus obtained, respectively denoted as Erro = { Erro = 1 ,erro 2 ,erro 3 ,erro 4 ,erro 5 }, and calculates an average error rate @>
Figure BDA0002045358430000032
Preferably, the initialization feature population is a two-dimensional matrix, denoted as P = { X = { 1 ,X 2 ,…,X N N is the population size; x i ={x i1 ,x i2 ,…,x in The symbol is a binary string, which represents a feature selection scheme; when x is ij When =1, the j-th signature gene representing the i-th signature selection scheme is selected when x ij If =0, it means that the corresponding feature is not selected; n represents the total number of characteristic genes.
Preferably, the fitness function f constructed in step C is:
Figure BDA0002045358430000033
wherein, N p Representing the current feature selection scheme X i Of the number of features selected in the image data,
Figure BDA0002045358430000034
selecting scheme X for a feature i Average error rate of; alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;
Figure BDA0002045358430000035
wherein alpha is t ∈[0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.
Preferably, the feature tree in step D is a binary tree, and the method for recording all feature selection schemes in the initial population into the feature tree includes the following steps;
step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P' for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, representing the feature selection scheme stored by the feature tree node pointed by the pointer node, and selecting a feature selection scheme X i Inserting root node, optimal feature selection scheme X best =X i Selecting the feature to be scheme X i Storing the data into the optimized population P', and jumping to the stepIi; if the feature tree is not empty, directly jumping to the step ii;
step ii: let i = i +1, the pointer node Cur _ node points to the root node, and the initialization repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and the current feature selection scheme X i X of (a) i(depth(Cur_node)) =X i (depth (Cur _ node)) =1, then pointer node Cur _ node points to the left child node if X i (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;
step iii: selecting scheme X if current feature to be inserted i The same as the pointer node Cur _ node, flag =1, and order
Figure BDA0002045358430000041
Skipping to step iv; judgment of X i (depth (Cur _ node)) ≠ Cur _ node), jumping to step iv; when X is present i (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), the pointer node Cur _ node and the current optimal feature selection scheme X are calculated separately best Hamming distance d of 1 =hamming(Cur_node,X best ) And to-be-inserted feature selection scheme X i And X best Hamming distance d of 2 =hamming(X i ,X best );
If d is 1 <d 2 Let us order
Figure BDA0002045358430000042
Judgment f (Cur _ node)<f(X best ) Then X best If not, updating the function evaluation time t = t +1, and skipping to the step iv; if d is 1 >d 2 Then make it asserted>
Figure BDA0002045358430000043
Figure BDA0002045358430000044
Skipping to step iv;
step iv: if X i (depth (Cur _ node)) =1, then X will be added i Inserting a left child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; if X i (depth (Cur _ node)) =0, then X will be added i Inserting a right child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a left child node of the Cur _ node into the feature tree;
step v: mixing X i Storing into optimized population P', if f (X) i )<f(X best ) Let X best =X i Otherwise, the evaluation times are unchanged, and the function is updated through t = t +1 after operation; if t is more than or equal to MaxEval or mod (i/N) =0, optimizing the population P' and the optimal feature selection scheme X best And its depth osr in the feature tree are fed back to the genetic operator and the guided search operator, respectively, where mod (i/N) represents the i/N remainder; otherwise, jumping to the step i.
Preferably, the method in step E is: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; and replacing the first osr bits of the selected partial feature selection scheme with the optimal feature selection scheme X best The value of the first osr bit to obtain the optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability p c ∈[0.5,0.8]Probability of variation p r And the guide search probability is 0.05-0.1.
Preferably, if t<Enabling P = 'P by MaxEval, emptying the optimized population P', and returning to the step D; if t is more than or equal to MaxEval, outputting an optimal feature selection scheme X best And verifying the optimal feature selection scheme X on the test set best The classification error rate of (2).
The cancer gene characteristic selection method based on historical data has the advantages that: the data dimensionality can be effectively reduced, the prediction accuracy is improved, and related genes of diseases such as cancer and the like are screened by combining a characteristic tree and a genetic algorithm, so that assistance is provided for diagnosis and treatment.
Drawings
FIG. 1 is a flow chart of a method for selecting a cancer disease gene signature based on historical data as provided by an embodiment of the invention;
FIG. 2 is a flow chart of an algorithm for optimizing populations through a feature tree according to an embodiment of the present invention;
FIG. 3 is a first schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a second method for optimizing a population through a feature tree according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a method for optimizing a population through a feature tree according to an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
As shown in fig. 1, the present embodiment provides a method for selecting cancer disease gene characteristics based on historical data, comprising the steps of:
step A: dividing the cancer disease gene data into a training data set and a testing data set; the cancer disease gene data are divided into ten parts, 7 parts are used as a training data set, and 3 parts are used as a testing data set.
And B: and calculating the total average error rate after all the features on the training data set are selected by utilizing the five-fold cross, wherein the total average error rate specifically comprises the following steps: equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rate
Figure BDA0002045358430000061
Wherein CE k The number of samples in the classification error in the feature selection scheme when the kth copy is used as the test set is shown,n represents the total number of samples in the test set; five error rates were obtained by calculation, respectively expressed as Erro = { Erro = } 1 ,erro 2 ,erro 3 ,erro 4 ,erro 5 And is calculated to get an average error rate->
Figure BDA0002045358430000071
The above steps are all the prior art in the field of processing sample data by using genetic algorithm, and are not described in detail here.
Step C: randomly generating an initialization feature population, wherein the initialization feature population is a two-dimensional matrix and is represented by P = { X = 1 ,X 2 ,…,X N N is the population size; x i ={x i1 ,x i2 ,…,x in Is a binary string, representing a feature selection scheme; when x is ij When =1, the j-th signature gene representing the i-th signature selection scheme is selected when x ij =0, this means that the corresponding feature is not selected; n represents the total number of characteristic genes;
constructing a fitness function f for evaluating the feature selection scheme:
Figure BDA0002045358430000072
wherein, N p Indicating the current feature selection scheme X i Of the number of features selected in (1),
Figure BDA0002045358430000073
selecting scheme X for a feature i Average error rate of (2); alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;
Figure BDA0002045358430000074
wherein alpha is t ∈[0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.
Step D: sequentially recording the feature selection schemes in the feature population into the feature tree, adjusting the distribution of the feature selection schemes to obtain the adjusted feature population, and taking the feature selection scheme with the minimum fitness value as an optimal feature selection scheme; the feature tree is a binary tree, and the method specifically comprises the following steps:
step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P' for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, representing the feature selection scheme stored by the feature tree node pointed by the pointer node, and selecting a feature selection scheme X i Inserting root node, optimal feature selection scheme X best =X i Selecting the feature to be scheme X i Storing the optimized population P' and skipping to the step ii; if the feature tree is not empty, directly jumping to the step ii;
step ii: let i = i +1, the pointer node Cur _ node points to the root node, and the initial repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and the current feature selection scheme X i X of i(depth(Cur_node)) =X i (depth (Cur _ node)) =1, then pointer node Cur _ node points to the left child node if X i (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;
step iii: selecting scheme X if current to-be-inserted feature i The same as the pointer node Cur _ node, flag =1, and order
Figure BDA0002045358430000081
Skipping to step iv; judgment of X i (depth (Cur _ node)) ≠ Cur _ node), jumping to step iv; when X is present i (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), the pointer node Cur _ node and the current optimal feature selection scheme X are calculated separately best Hamming distance d of 1 =hamming(Cur_node,X best ) And to-be-inserted feature selection scheme X i And X best Sea ofClear distance d 2 =hamming(X i ,X best );
If d is 1 <d 2 Let us order
Figure BDA0002045358430000082
Judgment f (Cur _ node)<f(X best ) Then X best If not, updating the function evaluation time t = t +1, and skipping to the step iv; if d is 1 >d 2 Then make it asserted>
Figure BDA0002045358430000083
Figure BDA0002045358430000084
Skipping to step iv;
step iv: if X i (depth (Cur _ node)) =1, X will be added i Inserting a left child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; if X i (depth (Cur _ node)) =0, X will be i Inserting a right child node serving as a Cur _ node into the feature tree, and inserting the Cur _ node serving as a left child node of the Cur _ node into the feature tree;
step v: mixing X i Storing into optimized population P', if f (X) i )<f(X best ) Let X best =X i Otherwise, the evaluation times are unchanged, and the function is updated through t = t +1 after operation; if t is more than or equal to MaxEval or mod (i/N) =0, optimizing the population P' and the optimal feature selection scheme X best And its depth osr in the feature tree are fed back to the genetic operator and the guided search operator, respectively, where mod (i/N) represents the i/N remainder; if t is<And MaxEval, jumping to the step i.
And E, step E: guiding the evolution direction of the characteristic population according to a genetic operator and a guide search operator: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; and the first osr bits of the selected partial feature selection scheme are replaced to be the most significantPreferred feature selection scheme X best The value of the first osr bit to obtain the optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability p c ∈[0.5,0.8]Probability of variation p r And the guide search probability is 0.05-0.1 according to the ratio of = 1/n.
Step F: judging the termination condition if t<Enabling P = 'P by MaxEval, emptying the optimized population P', and returning to the step D; if t is more than or equal to MaxEval, outputting an optimal feature selection scheme X best And verifying the optimal feature selection scheme X on the test set best The classification error rate of (2).
The data in table 1 is used as the initial population P to specifically explain the arrangement of the adjustment feature selection scheme on the feature tree:
Figure BDA0002045358430000091
table 1: initial population
First, parameters are set, and as can be seen from table 1, when N =5, the mutation probability p is r =1/6, guide search probability 0.05, let p c =0.5、α t =0.2; selecting an empty feature tree, initializing the function evaluation times t =0, i =1, and maxeval =10, referring to fig. 3, and calculating X i Inserting a root node of a null feature tree to make the optimal feature selection scheme X best =X i At this time, f (X) best ) =5, mixing X i And storing the obtained product into the empty optimization population P'.
Insertion feature selection scheme X 2
In step ii, i =2, the pointer node Cur _ node points to the root node, i.e. Cur _ node = X 1 The repeat access flag =0; at this time, the Cur _ node is a leaf node, the step iii is skipped, and the current feature selection scheme to be inserted is X 2 Unlike Cur _ node, flag =0, in this case depth (Cur _ node) =1, and X is known 2 (depth (Cur _ node)) =1, and when the pointer node Cur _ node (depth (Cur _ node)) =1 is the same, it is determined that Cur _ node and X are each determined 2 And X best Hamming distance of (d) as shown in FIG. 4 1 =0,d 2 =4; thus, d 1 <d 2 Execute by
Figure BDA0002045358430000101
Fast order x 11 =0, the fitness needs to be re-evaluated because the pointer node Cur _ node changes, and f (X) is performed at this time Cur_node )=f(X best )=2,X best Updating the function evaluation times t =1 without changing, and skipping to the step iv; at this time X 2 (depth (Cur _ node)) =1, so will X 2 Inserting the left child node as the pointer node Cur _ node and the modified pointer node Cur _ node as the right child node of the pointer node Cur _ node into the feature tree, and inserting X into the feature tree 2 Store into optimized population P', at which time f (X) best )=2<f(X 2 )=4,X best Updating the evaluation times of the function to t =2 without changing the evaluation times, wherein t is the time of the evaluation times<MaxEval with mod (i/N) ≠ 0, jumping to step i to continue the insert operation.
Insertion feature selection scheme X 3
Referring to fig. 5, when the feature tree is not empty, step ii is performed, i =3, the pointer node Cur _ node points to the root node, when there are two child nodes, and X of the current feature selection scheme is calculated 3 (depth(Cur_node))=x 31 =1, pointer node points to left child node, i.e. Cur _ node = X 2 When the pointer node Cur _ node is a leaf node, executing the step iii; since X 3 Unlike Cur _ node, i.e., flag =0, in which case x 32 =0, and depth (Cur _ node) of the pointer node = x 22 =0, therefore the pointer nodes Cur _ node and X need to be calculated 3 Are each independently of X best Obtaining d from the Hamming distance of 1 =4>d 2 =2, order
Figure BDA0002045358430000111
I.e. x 32 =1, go to step iv, add X 3 Inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; mixing X 3 Storing into population PAt this time, X best No change, no need to re-evaluate fitness, so f (X) best ) =2, for insertion X i To obtain f (X) best )=2<f(X 3 ) =5, therefore X best And unchanged, updating the evaluation times of the function to t =3, wherein t is the time<MaxEval and mod (i/N) ≠ 0, jumping to step i to continue the insert operation.
Insertion feature selection scheme X 4
Referring to fig. 6, when the feature tree is not empty, step ii is performed, i =4, the pointer node Cur _ node points to the root node, when there are two child nodes, and X of the current feature selection scheme is calculated 4 (depth(Cur_node))=x 41 =1, pointer node points to left child node, i.e. Cur _ node = X 2 At this time, the pointer node Cur _ node still has two child nodes, and step ii continues to be executed, at this time, X 4 (depth(Cur_node))=x 42 =0, pointer node points to the right child node, i.e. Cur _ node = X 2 The pointer node Cur _ node is a leaf node, and step iii is executed; from Table 1, it can be seen that X is present at this time 4 = Cur _ node, i.e. flag =1, let
Figure BDA0002045358430000112
I.e. x 43 =1, jump to step iv, updated X 4 As a left child node of the Cur _ node, inserting the Cur _ node as a right child node of the Cur _ node into the feature tree; mixing X 4 Store into population P', at which time f (X) best )=2>f(X 4 ) =1, let X best =X 4 Updating the function evaluation times to t =3, at which time t<MaxEval with mod (i/N) ≠ 0, jumping to step i to continue the insert operation.
Insertion feature selection scheme X 5
Referring to fig. 7, when the feature tree is not empty, step ii is performed, i =5, the pointer node Cur _ node is pointed to the root node, there are two child nodes, and X of the current feature selection scheme is calculated 5 (depth(Cur_node))=x 51 =0, pointer node points to the right child node, i.e. Cur _ node = X 1 At this time, the pointer node Cur _ nodeStep iii is performed for leaf nodes because Cur _ node and X 5 In contrast, flag =0, at this time, cur _ node (depth) of the pointer node = x 12 =0, and X to be inserted into the eigen scheme 5 (depth(Cur_node))=x 52 =1, perform step iv, select feature to insert scheme X 5 Inserting the Cur _ node serving as a right child node of the Cur _ node into the feature tree; mixing X 4 Store into population P', at which time f (X) best )=1<f(X 5 )=6,X best And (4) updating the evaluation times of the function to t =4, wherein mod (i/N) =0, optimizing the population P', and selecting the optimal feature selection scheme X best And the depth osr in the feature tree are respectively fed back to the genetic operator and the guide search operator, the population P 'is optimized by using the uniform crossing operation and the standard mutation operation in the existing genetic algorithm, the population evolution is guided to finally obtain the population' P, and at the moment, t<And MaxEval, enabling P = 'P, emptying the optimized population P', returning to the step D, and continuing to execute the insertion operation until a termination condition is met.
This example verifies the validity of the algorithm provided by this example by using the data set in the scimit-feature source feature selection library project at the university of arizona state. As shown in table 2, a Colon cancer data set Colon, a Lung cancer data set Lung, and a Glioma data set Glioma are selected for verification, and the error rate and the feature quantity are selected as evaluation indexes, specifically, the lower the error rate is, the smaller the feature quantity is, and the better the algorithm performance is.
Data set Number of features Number of samples Number of classification
Colon 2000 62 2
Lung 3312 203 5
Glioma 4434 50 4
Table 2: cancer disease gene data
The classical genetic algorithm GA and the particle swarm optimization algorithm PSO are selected to be compared with the algorithm provided by the embodiment for a simulation experiment, the result is shown in table 3, the algorithm provided by the embodiment is defined as BSPGA, and the experimental result shows that the performance of the algorithm provided by the embodiment is obviously superior to that of the GA algorithm and the PSO algorithm.
Figure BDA0002045358430000131
Table 3: comparison of simulation experiments
The effectiveness and use scenario of the algorithm is described in the application starting with cancer data, but the algorithm is also effective for other diseases related to genetic variation.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions, improvements and the like made by those skilled in the art without departing from the spirit and principles of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims (6)

1. A method for selecting a gene signature for a cancer disease based on historical data, comprising the steps of:
step A: dividing the cancer disease gene data into a training data set and a testing data set;
and B: calculating the total average error rate of all selected cancer disease gene characteristics on the training data set by utilizing five-fold intersection;
and C: randomly generating an initialized cancer disease gene characteristic population, and constructing a fitness function for evaluating a cancer disease gene characteristic selection scheme;
step D: sequentially recording the characteristic selection schemes in the cancer disease gene characteristic population into a cancer disease gene characteristic tree, adjusting the distribution of the characteristic selection schemes to obtain an adjusted cancer disease gene characteristic population, and taking the characteristic selection scheme with the minimum fitness value as an optimal characteristic selection scheme;
step E: guiding the evolution direction of the cancer disease gene characteristic population according to the genetic operator and the oriented search operator; taking the cancer disease gene characteristic population optimized by the cancer disease gene characteristic tree as a parent population, and generating an offspring population by using a genetic operator; taking the position information of the feature tree where the optimal cancer disease gene feature selection scheme is as a search direction, and reinforcing local search by utilizing a guide search operator;
step F: judging a termination condition, if the termination condition is not met, repeating the steps D-F, if the termination condition is met, outputting an optimal feature selection scheme, and verifying the classification error rate of the optimal feature selection scheme on a test set;
in step D, said cancer disease gene signature tree is a binary tree, and said method of recording all cancer disease gene signature selection schemes in the initial cancer disease gene signature population into the signature tree comprises the steps of:
step i: detecting data in the feature tree, if the feature tree is empty, initializing function evaluation times t =0, i =1, inputting a setting value of MaxEval, emptying an optimized population P 'for storing an optimized feature selection scheme, initializing a pointer node Cur _ node to be empty, indicating the feature selection scheme stored by the feature tree node pointed by the pointer node, inserting the feature selection scheme Xi into a root node, storing the feature selection scheme Xi into the optimized population P', and jumping to the step ii; if the feature tree is not empty, directly jumping to the step ii;
step ii: let i = i +1, pointer node Cur _ node point to root node, and initialize repeat access flag =0; if the Cur _ node is a leaf node, skipping to the step iii; if the pointer node Cur _ node has two child nodes, and Xi (depth (Cur _ node)) =1 for the current feature selection scheme Xi, then the pointer node Cur _ node points to the left child node, and if Xi (depth (Cur _ node)) =0, then Cur _ node points to the right child node; where depth (Cur _ node) represents the depth of the pointer node Cur _ node; repeating the current step until the pointer node Cur _ node is a leaf node;
step iii: if the current feature selection scheme Xi to be inserted is the same as the pointer node Cur _ node, flag =1, and the instruction is made
Figure FDA0004100882800000023
Go to step iv; when judging Xi (depth (Cur _ node)) ≠ Cur _ node (depth (Cur _ node)), jumping to step iv; when Xi (depth (Cur _ node)) = Cur _ node (depth (Cur _ node)), hamming distances d1= hamming (Cur _ node, xtest) of the pointer node Cur _ node and the current optimal feature selection scheme xtest and hamming distances d2= hamming (Xi, xtest) of the feature selection scheme Xi to be inserted and xtest are calculated, respectively; if d1 is less than d2, let
Figure FDA0004100882800000021
Judging that f (Cur _ node) < f (Xtest), if Xtest = Cur _ node, otherwise, updating the function evaluation times t = t +1, and skipping to the step iv; if d1 > d2, let->
Figure FDA0004100882800000022
Skipping to step iv; />
Step iv: if Xi (depth (Cur _ node)) =1, inserting Xi as the left child node of the Cur _ node into the feature tree, and inserting the Cur _ node as the right child node of the Cur _ node into the feature tree; if Xi (depth (Cur _ node)) =0, inserting Xi as the right child node of the Cur _ node into the feature tree, and inserting Cur _ node as the left child node of the Cur _ node into the feature tree;
and v: storing Xi into an optimized population P', if f (Xi) < f (Xtest), making Xtest = Xi, otherwise, keeping the Xtest = Xi, and updating the evaluation times of the function through t = t +1 after operation; if t is larger than or equal to MaxEval or mod (i/N) =0, feeding back the optimized population P', the optimal feature selection scheme Xtest and the depth osr thereof in the feature tree to a genetic operator and a guiding search operator respectively, wherein i/N remainder is represented by mod (i/N); otherwise, jumping to the step i.
2. A method of selecting gene signature of cancer disease based on historical data as claimed in claim 1 wherein: the method for calculating the average classification error rate after all the cancer disease gene characteristics on the training data set are selected by using the five-fold cross method in the step B comprises the following steps:
equally dividing the training data set into five parts, sequentially selecting one part as a test set and the other four parts as the training set to obtain the error rate
Figure FDA0004100882800000031
CEk represents the number of samples with classification errors in the feature selection scheme when the kth sample is used as the test set, and N represents the total number of samples in the test set; five error rates are thus obtained, which are respectively designated Erro = { Erro1, erro2, erro3, erro4, erro5}, and the average error rate = based on the values of the error rates are calculated>
Figure FDA0004100882800000032
3. A method of selecting cancer disease gene signatures based on historical data as claimed in claim 2 wherein: the initialized cancer disease gene signature population is a two-dimensional matrix represented as P = { X1, X2, …, XN }, where N is the population size; xi = { Xi1, xi2, …, xin } is a binary string, representing a feature selection scheme; when xij =1, a j-th feature gene representing an i-th feature selection scheme is selected, and when xij =0, it represents that a corresponding feature is not selected; n represents the total number of characteristic genes.
4. A method of selecting gene signature of cancer disease based on historical data as claimed in claim 3 wherein: the fitness function f constructed in the step C is as follows:
Figure FDA0004100882800000033
where Np represents the number of features selected in the current feature selection scheme Xi,
Figure FDA0004100882800000034
selecting an average error rate for the scheme Xi for the feature; alpha is a variable coefficient and is used for adjusting the error rate and the weight of the selected feature number in the function;
Figure FDA0004100882800000035
wherein alpha t is epsilon [0.15,0.4]And t is a preset parameter and represents the current evaluation times, and MaxEval is the maximum function evaluation times.
5. A method of selecting cancer disease gene signatures based on historical data as claimed in claim 1 wherein: the method in the step E comprises the following steps: taking the optimized population P ' as a parent population of a genetic operator, executing uniform crossing operation and standard mutation operation to obtain a sub-population P ', selecting a part of feature selection schemes from the sub-population P ' through guide search probability to execute guide search; replacing the front osr bits of the selected partial feature selection scheme with the front osr bits of the optimal feature selection scheme Xtest to obtain an optimal population P'; combining the initial population P with the optimal population P 'and selecting the population' P entering the next generation by adopting an elite environment selection strategy; wherein the cross probability pc belongs to [0.5,0.8], the mutation probability pr =1/n, and the guide search probability is 0.05-0.1.
6. A method according to claim 5, wherein the selection of the genetic profile of the cancer disease is based on historical data, wherein: if t is less than MaxEval, making P = 'P, emptying the optimized population P', and returning to the step D; and if t is larger than or equal to MaxEval, outputting an optimal feature selection scheme Xtest, and verifying the classification error rate of the optimal feature selection scheme Xtest on the test set.
CN201910355711.3A 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method Active CN110070916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910355711.3A CN110070916B (en) 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910355711.3A CN110070916B (en) 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method

Publications (2)

Publication Number Publication Date
CN110070916A CN110070916A (en) 2019-07-30
CN110070916B true CN110070916B (en) 2023-04-18

Family

ID=67369518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910355711.3A Active CN110070916B (en) 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method

Country Status (1)

Country Link
CN (1) CN110070916B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243662B (en) * 2020-01-15 2023-04-21 云南大学 Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost
CN112580606B (en) * 2020-12-31 2022-11-08 安徽大学 Large-scale human body behavior identification method based on clustering grouping

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN109242100A (en) * 2018-09-07 2019-01-18 浙江财经大学 A kind of Niche Genetic method on multiple populations for feature selecting

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN109242100A (en) * 2018-09-07 2019-01-18 浙江财经大学 A kind of Niche Genetic method on multiple populations for feature selecting

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范方云 ; 孙俊 ; .基于BQPSO算法的癌症特征基因选择与分类.江南大学学报(自然科学版).2015,(01),全文. *

Also Published As

Publication number Publication date
CN110070916A (en) 2019-07-30

Similar Documents

Publication Publication Date Title
Ching et al. Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data
Tong et al. Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis
Boulesteix et al. IPF-LASSO: integrative L1-penalized regression with penalty factors for prediction based on multi-omics data
AU2016209478B2 (en) Systems and methods for response prediction to chemotherapy in high grade bladder cancer
KR20210018333A (en) Method and apparatus for multimodal prediction using a trained statistical model
Zhang et al. DeepHE: Accurately predicting human essential genes based on deep learning
JP2020501240A (en) Methods and systems for predicting DNA accessibility in pan-cancer genomes
Lu et al. Predicting human lncRNA-disease associations based on geometric matrix completion
CN110070916B (en) Historical data-based cancer disease gene characteristic selection method
Liang et al. A deep learning framework to predict tumor tissue-of-origin based on copy number alteration
Wang et al. Multiple surrogates and offspring-assisted differential evolution for high-dimensional expensive problems
Fu et al. An improved multi-objective marine predator algorithm for gene selection in classification of cancer microarray data
Zhang et al. MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations
Han et al. Inverse-weighted survival games
CN110009128A (en) Industry public opinion index prediction technique, device, computer equipment and storage medium
Li et al. Assisted gene expression‐based clustering with AWNCut
Hem et al. Robust modeling of additive and nonadditive variation with intuitive inclusion of expert knowledge
Vidyasagar Probabilistic methods in cancer biology
JP7490576B2 (en) Method and apparatus for multimodal prediction using trained statistical models - Patents.com
Kasianov et al. Interspecific comparison of gene expression profiles using machine learning
Wu et al. A practical algorithm based on particle swarm optimization for haplotype reconstruction
Mapelli Multi-outcome feature selection via anomaly detection autoencoders: an application to radiogenomics in breast cancer patients
Ng High-dimensional varying-coefficient models for genomic studies
Qiu Imputation and Predictive Modeling with Biomedical Multi-Scale Data
Strauch Improving diagnosis of genetic disease through computational investigation of splicing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant