CN110070916A - A kind of Cancerous disease gene expression characteristics selection method based on historical data - Google Patents
A kind of Cancerous disease gene expression characteristics selection method based on historical data Download PDFInfo
- Publication number
- CN110070916A CN110070916A CN201910355711.3A CN201910355711A CN110070916A CN 110070916 A CN110070916 A CN 110070916A CN 201910355711 A CN201910355711 A CN 201910355711A CN 110070916 A CN110070916 A CN 110070916A
- Authority
- CN
- China
- Prior art keywords
- node
- cur
- feature
- population
- scheme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2111—Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Physiology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The Cancerous disease gene expression characteristics selection method based on historical data that the invention discloses a kind of, comprising the following steps: Cancerous disease gene data A: is divided into training set and test set;B: the overall average error rate after all features are all selected on training set is calculated;C: generating initial population, constructs fitness function;D: all feature selecting schemes are recorded in characteristics tree, adjust the distribution of feature selecting scheme, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme, return result to genetic operator and guiding searching operators;E: the Evolutionary direction of guide features population;F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, exports optimal solution.The present invention has the advantages that can be effectively reduced data dimension, forecasting accuracy is improved, by characteristics tree, is screened in conjunction with the related gene that genetic algorithm is the diseases such as cancer, provides auxiliary for diagnosing and treating.
Description
Technical field
The present invention relates to the Disease-causing gene screening technique field of disease more particularly to a kind of cancer diseases based on historical data
Ospc gene feature selection approach.
Background technique
The cancer malignant tumour most commonly seen as the mankind, seriously affects the physical and mental health of people, causes different field
Experts and scholars great attention.Since cancer patient can generate a large amount of clinical number during inspection, treatment and medication
According to.These data for prediction malignant tumour there is a situation where the development of, disease to have very important significance.But these data
Often have the characteristics that high dimension, diversification, data corruption and small sample.
With the rapid development of computer technology, it is thought that going to handle these complex datas using computer.By building
Relevant machine learning prediction model is found, predicts the generation and development of cancer, the rehabilitation and postoperative for patient
The relevant technologies guidance and advisory opinion are provided.The exception for higher-dimension, redundancy and the data that Cancerous disease gene data has is scarce
The features such as mistake, therefore before handling these data, it usually needs cleaning is carried out to data, gets rid of uncorrelated and redundancy
Data.
The simple cleaning of data not can effectively solve the high-dimensional problem of Cancerous disease gene data.Therefore, for
High-dimensional clinical data also needs to develop specific engineering method and targetedly algorithm, to reach dimensionality reduction and improve prediction mould
The purpose of type accuracy.It should go to design and develop correlation technique in terms of following two, first be exactly that model should have reality
With property, prediction model needs doctor that the case where patient is input to predicting platform in being embedded into assessment system, if prediction needs
A large amount of sample data is wanted, the burden of doctor is often will increase, does not have practical function.Secondly, it is also necessary to consider model prediction
Accuracy, accurate prediction model can assist the diagnosis of doctor, rehabilitation to patient and reduce disease palindromia
Risk provides suggestion.
Therefore, for Cancerous disease gene data, it can choose there has been proposed a large amount of method and provided important feature
Data, reduce the dimension of data, improve the accuracy rate of prediction model.These methods mainly have single factor analysis method, recurrence
Feature removing method and feature importance analysis method etc..However existing method often exists that dimensionality reduction effect is unobvious, redundancy is special
It goes on a punitive expedition the problems such as more, time complexity is excessively high and not very practical.For the feature selection issues of high dimensional data, develop
Engineering method needs under the premise of can guarantee prediction model accuracy, reduces the time complexity of method.
Summary of the invention
Technical problem to be solved by the present invention lies in providing one kind to can be effectively reduced data dimension, prediction model is improved
The cancer gene feature selection approach of accuracy.
The present invention is to solve above-mentioned technical problem by the following technical programs:
A kind of Cancerous disease gene expression characteristics selection method based on historical data, comprising the following steps:
Step A: Cancerous disease gene data is divided into training dataset and test data set;
Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets;
Step C: it is random to generate initialization feature population, it is configured to the fitness function of evaluating characteristic selection scheme;
Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, adjustment feature selecting scheme
Distribution, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme;
Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population;Characteristics tree is optimized
Feature population as parent population, utilize genetic operator to generate progeny population;By the feature where optimal feature selection scheme
The location information of tree reinforces local search as the direction of search, using guiding searching operators;
Step F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, output
Optimal feature selection scheme verifies the classification error rate of optimal feature selection scheme on test set.
Preferably, after all being selected described in step B using all features on five folding interior extrapolation methods calculating training dataset
The method for error rate of averagely classifying are as follows:
Training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts are obtained as training set
Take error rateWherein CEkIndicate kth part as test set when feature selecting scheme in classification error sample
Number, N indicate the total sample number of test set;To obtain five error rates, it is denoted as Erro={ erro respectively1,erro2,erro3,
erro4,erro5, and vision response test is calculated
Preferably, the initialization feature population is two-dimensional matrix, is expressed as P={ X1, X2,···,XN, wherein N
For Population Size;Xi={ xi1,xi2,···,xinIt is string of binary characters, indicate a kind of feature selecting scheme;Work as xij=1
When, it indicates that j-th of characterizing gene of ith feature selection scheme is selected, works as xijWhen=0, indicate that corresponding feature is not chosen
It selects;N indicates characterizing gene sum.
Preferably, the fitness function f constructed in step C are as follows:
Wherein, NpIndicate current signature selection scheme XiIn be selected feature quantity,It is characterized selection scheme Xi
Vision response test;α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function;Wherein αt∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most
Big function evaluates number.
Preferably, characteristics tree described in step D is binary tree, all feature selecting schemes by initial population
The method being recorded in characteristics tree includes the following steps;
Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i=
1, the setting value of MaxEval is inputted, the optimization population P ' for being used for the feature selecting scheme after storage optimization is emptied, initialization refers to
Pin knot point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature
Selection scheme XiIt is inserted into root node, optimal feature selection scheme Xbest=Xi, by feature selecting scheme XiIt stores to optimization population
In P ', go to step ii;If characteristics tree non-empty jumps directly to step ii;
Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0;
If Cur_node is leaf node, go to step iii;If there are two child nodes by pointer node Cur_node, and currently special
Levy selection scheme XiXi(depth(Cur_node))=Xi(depth (Cur_node))=1, then pointer node Cur_node is directed toward left
Child nodes, if Xi(depth (Cur_node))=0, then Cur_node is directed toward right child nodes;Wherein depth (Cur_
Node the depth of pointer node Cur_node) is indicated;Current procedures are repeated until pointer node Cur_node is leaf node;
Step iii: if being currently inserted into feature selecting scheme XiIdentical as pointer node Cur_node, then flag=1, enablesGo to step iv;Judge Xi(depth(Cur_
When node)) ≠ Cur_node (depth (Cur_node)), go to step iv;Work as Xi(depth (Cur_node))=Cur_
When node (depth (Cur_node)), pointer node Cur_node and current optimal feature selection scheme X are calculated separatelybest's
Hamming distances d1=hamming (Cur_node, Xbest) and be inserted into feature selecting scheme XiWith XbestHamming distances d2=
hamming(Xi,Xbest);
If d1< d2, enableJudge f
(Cur_node) < f (Xbest), then Xbest=Cur_node, otherwise constant, renewal function evaluates number t=t+1, jumps to step
Rapid iv;If d1> d2, then enable Go to step iv;
Step iv: if Xi(depth (Cur_node))=1, then by XiLeft child nodes as Cur_node are inserted into feature
In tree, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;If Xi(depth (Cur_node))=0,
Then by XiRight child nodes as Cur_node are inserted into characteristics tree, using Cur_node as the left child nodes of Cur_node
It is inserted into characteristics tree;
Step v: by XiIt stores into optimization population P ', if f (Xi) < f (Xbest), enable Xbest=Xi, otherwise constant, fortune
Number is evaluated by t=t+1 renewal function after calculation;If t >=MaxEval or mod (i/N)=0, population P ', optimal spy will be optimized
Levy selection scheme XbestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod respectively
(i/N) the i/N remainder indicated;Otherwise, go to step i.
Preferably, method described in step E are as follows: parent population of the population P ' as genetic operator will be optimized, executed uniform
Crossover operation and normal bit mutation operation obtain sub- population P ", by being oriented to searching probability from selected section feature in sub- population P "
Selection scheme executes guiding search;And optimal feature selection is substituted for by preceding osr of the Partial Feature selection scheme selected
Scheme XbestPreceding osr values, obtain optimal population P " ';Initial population P is merged with optimal population P " ', using elite
Environmental selection strategy, selection enter follow-on population `P;Wherein crossover probability pc∈ [0.5,0.8], mutation probability pr=1/n,
Being oriented to searching probability is 0.05~0.1.
Preferably, if t < MaxEval, enables P=`P, optimization population P ' is emptied, return step D;If t >=MaxEval,
Export optimal feature selection scheme Xbest, and optimal feature selection scheme X is verified on test setbestClassification error rate.
A kind of the advantages of cancer gene feature selection approach based on historical data provided by the invention, is: can be effective
Data dimension is reduced, forecasting accuracy is improved, is carried out by the related gene that characteristics tree combination genetic algorithm is the diseases such as cancer
Screening, provides auxiliary for diagnosing and treating.
Detailed description of the invention
Fig. 1 is the stream of the Cancerous disease gene expression characteristics selection method based on historical data provided by the embodiment of the present invention
Cheng Tu;
Fig. 2 is the algorithm flow chart for passing through feature tree optimization population provided by the embodiment of the present invention;
Fig. 3 is the method schematic one by feature tree optimization population that the embodiment of the present invention provides;
Fig. 4 is the method schematic two by feature tree optimization population that the embodiment of the present invention provides;
Fig. 5 is the method schematic three by feature tree optimization population that the embodiment of the present invention provides;
Fig. 6 is the method schematic four by feature tree optimization population that the embodiment of the present invention provides;
Fig. 7 is the method schematic five for passing through feature tree optimization population provided by the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference
Attached drawing, the present invention is described in further detail.
As shown in Figure 1, present embodiments providing a kind of Cancerous disease gene expression characteristics selection method based on historical data, wrap
Include following steps:
Step A: Cancerous disease gene data is divided into training dataset and test data set;Specially by Cancerous disease base
Because data are divided into ten parts, 7 parts are used as training dataset, and 3 parts are used as test data set.
Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets, tool
Body are as follows: training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts wrong as training set acquisition
Accidentally rateWherein CEkIndicate kth part as test set when feature selecting scheme in classification error sample number, N
Indicate the total sample number of test set;Five error rates are obtained by calculating, are expressed as Erro={ erro1,erro2,
erro3,erro4,erro5, and vision response test is calculatedAbove-mentioned steps are that this field uses at genetic algorithm
The prior art for managing sample data, herein without being discussed in detail.
Step C: random to generate initialization feature population, the initialization feature population is two-dimensional matrix, is expressed as P=
{X1, X2,···,XN, wherein N is Population Size; Xi={ xi1,xi2,···,xinIt is string of binary characters, indicate one
Kind feature selecting scheme;Work as xijWhen=1, indicates that j-th of characterizing gene of ith feature selection scheme is selected, work as xij=0
When, indicate that corresponding feature is unselected;N indicates characterizing gene sum;
It is configured to the fitness function f of evaluating characteristic selection scheme:
Wherein, NpIndicate current signature selection scheme XiIn be selected feature quantity,It is characterized selection scheme Xi
Vision response test;α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function;Wherein αt∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most
Big function evaluates number.
Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, adjustment feature selecting scheme
Distribution, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme;Institute
Stating characteristics tree is binary tree, the method specifically:
Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i=
1, the setting value of MaxEval is inputted, the optimization population P ' for being used for the feature selecting scheme after storage optimization is emptied, initialization refers to
Pin knot point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature
Selection scheme XiIt is inserted into root node, optimal feature selection scheme Xbest=Xi, by feature selecting scheme XiIt stores to optimization population
In P ', go to step ii;If characteristics tree non-empty jumps directly to step ii;
Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0;
If Cur_node is leaf node, go to step iii;If there are two child nodes by pointer node Cur_node, and currently special
Levy selection scheme XiXi(depth(Cur_node))=Xi(depth (Cur_node))=1, then pointer node Cur_node is directed toward left
Child nodes, if Xi(depth (Cur_node))=0, then Cur_node is directed toward right child nodes;Wherein depth (Cur_
Node the depth of pointer node Cur_node) is indicated;Current procedures are repeated until pointer node Cur_node is leaf node;
Step iii: if being currently inserted into feature selecting scheme XiIdentical as pointer node Cur_node, then flag=1, enablesGo to step iv;Judge Xi(depth(Cur_
When node)) ≠ Cur_node (depth (Cur_node)), go to step iv;Work as Xi(depth (Cur_node))=Cur_
When node (depth (Cur_node)), pointer node Cur_node and current optimal feature selection scheme X are calculated separatelybest's
Hamming distances d1=hamming (Cur_node, Xbest) and be inserted into feature selecting scheme XiWith XbestHamming distances d2=
hamming(Xi,Xbest);
If d1< d2, enableJudge f
(Cur_node) < f (Xbest), then Xbest=Cur_node, otherwise constant, renewal function evaluates number t=t+1, jumps to step
Rapid iv;If d1> d2, then enable Go to step iv;
Step iv: if Xi(depth (Cur_node))=1, then by XiLeft child nodes as Cur_node are inserted into feature
In tree, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;If Xi(depth (Cur_node))=0,
Then by XiRight child nodes as Cur_node are inserted into characteristics tree, using Cur_node as the left child nodes of Cur_node
It is inserted into characteristics tree;
Step v: by XiIt stores into optimization population P ', if f (Xi) < f (Xbest), enable Xbest=Xi, otherwise constant, fortune
Number is evaluated by t=t+1 renewal function after calculation;If t >=MaxEval or mod (i/N)=0, population P ', optimal spy will be optimized
Levy selection scheme XbestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod respectively
(i/N) the i/N remainder indicated;If t < MaxEval, go to step i.
Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population: population P ' work will be optimized
For the parent population of genetic operator, executes uniform crossover operator and normal bit mutation operation obtains sub- population P ", searched by guiding
Rope probability executes guiding search from selected section feature selecting scheme in sub- population P ";And the Partial Feature selecting party that will be selected
Preceding osr of case are substituted for optimal feature selection scheme XbestPreceding osr values, obtain optimal population P " ';By initial kind
Group P merges with optimal population P " ', and using elite environmental selection strategy, selection enters follow-on population `P;Wherein crossover probability
pc∈ [0.5,0.8], mutation probability pr=1/n, guiding searching probability are 0.05~0.1.
Step F: judging termination condition, if t < MaxEval, enables P=`P, optimization population P ' is emptied, return step D;If
T >=MaxEval exports optimal feature selection scheme Xbest, and optimal feature selection scheme X is verified on test setbestClassification
Error rate.
Below using the data of table 1 as initial population P to arrangement mode of the adjustment feature selecting scheme in characteristics tree into
Row specific explanations:
Table 1: initial population
Parameter is configured first, as shown in Table 1, N=5, then mutation probability pr=1/6, being oriented to searching probability is
0.05, enable pc=0.5, αt=0.2;Characteristics tree has been chosen, initialization function evaluates number t=0, i=1, MaxEval=10,
With reference to Fig. 3, by XiIt has been inserted into the root node of characteristics tree, has enabled optimal feature selection scheme Xbest=Xi, f (X at this timebest)=5,
By XiIt stores into sky optimization population P '.
It is inserted into feature selecting scheme X2:
In step ii, i=2, pointer node Cur_node are directed toward root node, i.e. Cur_node=X1, repeated accesses mark
Remember flag=0;Cur_node is leaf node at this time, and go to step iii, and being currently inserted into feature selecting scheme is X2, with
Cur_node is different, flag=0, at this time depth (Cur_node)=1, it is known that X2(depth (Cur_node))=1, with finger
Pin knot point Cur_node (depth (Cur_node))=1 is identical, then judges Cur_node and X respectively2With XbestHamming distances,
As shown in figure 4, d1=0, d2=4;Therefore, d1< d2, executeEven x11=0, by
Variation has been given birth in pointer node Cur_node, has needed to reappraise fitness, at this time f (XCur_node)=f (Xbest)=2, Xbest
It does not change, renewal function evaluates number t=1, and go to step iv;X at this time2(depth (Cur_node))=1, therefore
By X2As the left child nodes of pointer node Cur_node, modified pointer node Cur_node as pointer node Cur_
In the right child nodes insertion characteristics tree of node, by X2It stores into optimization population P ', at this time f (Xbest< f (the X of)=22)=4,
XbestIt does not change, renewal function evaluates number to t=2, at this time t < MaxEval and mod (i/N) ≠ 0, and go to step i
Continue to execute insertion operation.
It is inserted into feature selecting scheme X3:
With reference to Fig. 5, characteristics tree non-empty, executes step ii at this time, and i=3, pointer node Cur_node are directed toward root node, this
When there are two child nodes, calculate the X of current signature selection scheme3(depth (Cur_node))=x31=1, pointer node refers to
To left child nodes, i.e. Cur_node=X2, pointer node Cur_node is leaf node at this time, executes step iii;Because of X3
It is different from Cur_node, i.e. flag=0, at this time x32=0, and depth (Cur_node) (depth (Cur_ of pointer node
Node))=x22=0, it is therefore desirable to calculate pointer node Cur_node and X3Respectively with XbestHamming distances, obtain d1=4
> d2=2, it enablesThat is x32=1, iv is entered step,
By X3As the left child nodes of Cur_node, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;
By X3It stores into population P ', at this time XbestDo not change, does not need to reappraise fitness, therefore f (Xbest)=2, to insertion
XiEvaluation fitness, obtain f (Xbest< f (the X of)=23)=5, therefore XbestIt is constant, renewal function evaluation number to t=3, this
When t < MaxEval and mod (i/N) ≠ 0, the i that gos to step continue to execute insertion operation.
It is inserted into feature selecting scheme X4:
With reference to Fig. 6, characteristics tree non-empty, executes step ii at this time, and i=4, pointer node Cur_node are directed toward root node, this
When there are two child nodes, calculate the X of current signature selection scheme4(depth (Cur_node))=x41=1, pointer node refers to
To left child nodes, i.e. Cur_node=X2, still there are two child nodes by pointer node Cur_node at this time, continue to execute
Step ii, at this time X4(depth (Cur_node))=x42=0, pointer node is directed toward right child nodes, i.e. Cur_node=X2,
Pointer node Cur_node is leaf node, executes step iii;X at this time is known that according to table 14=Cur_node, i.e. flag
=1, it enablesThat is x43=1, go to step iv, after update
X4As the left child nodes of Cur_node, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;
By X4It stores into population P ', at this time f (Xbest> f (the X of)=24)=1, enables Xbest=X4, renewal function evaluation number to t=3,
T < MaxEval and mod (i/N) ≠ 0, the i that gos to step continue to execute insertion operation at this time.
It is inserted into feature selecting scheme X5:
With reference to Fig. 7, characteristics tree non-empty, executes step ii, i=5 at this time, and pointer node Cur_node is directed toward root node,
There are two child nodes, calculate the X of current signature selection scheme5(depth (Cur_node))=x51=0, pointer node is directed toward
Right child nodes, i.e. Cur_node=X1, pointer node Cur_node is leaf node at this time, step iii is executed, because
Cur_node and X5Difference, flag=0, at this time Cur_node (depth (Cur_node))=x of pointer node12=0, and to
It is inserted into the X of featured aspects5(depth (Cur_node))=x52=1, step iv is executed, feature selecting scheme X will be inserted into5Make
For the left child nodes of Cur_node, it is inserted into Cur_node as the right child nodes of Cur_node in characteristics tree;By X4It deposits
It stores up into population P ', at this time f (Xbest)=1 < f (X5)=6, XbestConstant, renewal function evaluates number to t=4, at this time mod
(i/N)=0 population P ', optimal feature selection scheme X, will be optimizedbestAnd its depth osr in characteristics tree is fed back to respectively
Genetic operator and guiding searching operators, by using in existing genetic algorithm uniform crossover operator and normal bit mutation operation at
Reason optimization population P ', guidance Evolution of Population finally obtain population `P, and t < MaxEval, enables P=`P at this time, and optimization population P ' is set
Sky, return step D continues to execute insertion operation, until meeting termination condition.
The present embodiment founds university scikit-feature open source feature selecting library item mesh by using State of Arizona, US
In data set verify the validity of algorithm provided by the present embodiment.As shown in table 2, a colon cancer data set is had chosen
Colon, a lung cancer data set Lung and a nervousness tumor data set Glioma are verified, and select error rate and feature
Quantity is lower as evaluation index, specially error rate, and feature quantity is smaller, and algorithm performance is better.
Table 2: Cancerous disease gene data
It chooses classical Genetic Algorithms and particle swarm optimization algorithm PSO and algorithm provided by the embodiment carries out emulation in fact
Comparison is tested, the results are shown in Table 3, wherein algorithm provided in this embodiment is defined as BSPGA, can be seen that from experimental result
The performance of algorithm provided in this embodiment is substantially better than GA algorithm and PSO algorithm.
Table 3: emulation experiment comparison
The application starts with to illustrate the validity and usage scenario of algorithm with cancer data, but for other and genetic mutation
This algorithm is equally effective for relevant disease.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects
It describes in detail bright, it should be understood that the above is only a specific embodiment of the present invention, is not intended to restrict the invention, not
It is any modification that those of ordinary skill in the art make the present invention, equivalent under the premise of being detached from the spirit and principles in the present invention
Replacement, improvement etc., should all fall within the protection scope that claims of the present invention determines.
Claims (7)
1. a kind of Cancerous disease gene expression characteristics selection method based on historical data, it is characterised in that: the following steps are included:
Step A: Cancerous disease gene data is divided into training dataset and test data set;
Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets;
Step C: it is random to generate initialization feature population, it is configured to the fitness function of evaluating characteristic selection scheme;
Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, point of adjustment feature selecting scheme
Cloth, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme.
Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population;The spy that characteristics tree is optimized
Population is levied as parent population, generates progeny population using genetic operator;By the characteristics tree where optimal feature selection scheme
Location information reinforces local search as the direction of search, using guiding searching operators;
Step F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, exports optimal
Feature selecting scheme verifies the classification error rate of optimal feature selection scheme on test set.
2. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 1, feature exist
In: the average classification error after all being selected described in step B using all features on five folding interior extrapolation methods calculating training dataset
The method of rate are as follows:
Training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts wrong as training set acquisition
Accidentally rateWherein CEkIndicate kth part as test set when feature selecting scheme in classification error sample number, N
Indicate the total sample number of test set;To obtain five error rates, it is denoted as Erro={ erro respectively1,erro2,erro3,
erro4,erro5, and vision response test is calculated
3. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 2, feature exist
In: the initialization feature population is two-dimensional matrix, is expressed as P={ X1, X2,…,XN, wherein N is Population Size;Xi=
{xi1,xi2,…,xinIt is string of binary characters, indicate a kind of feature selecting scheme;Work as xijWhen=1, ith feature selection is indicated
J-th of characterizing gene of scheme is selected, and x is worked asijWhen=0, indicate that corresponding feature is unselected;N indicates characterizing gene sum.
4. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 3, feature exist
In: the fitness function f constructed in step C are as follows:
Wherein, NpIndicate current signature selection scheme XiIn be selected feature quantity,It is characterized selection scheme XiIt is flat
Equal error rate;α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function;Wherein αt∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most
Big function evaluates number.
5. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 4, feature exist
In: characteristics tree described in step D is binary tree, and characteristics tree is recorded in all feature selecting schemes by initial population
In method include the following steps;
Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i=1, defeated
The setting value for entering MaxEval empties the optimization population P ' for being used for the feature selecting scheme after storage optimization, initialization pointers section
Point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature selecting
Scheme XiIt is inserted into root node, optimal feature selection scheme Xbest=Xi, by feature selecting scheme XiIt stores into optimization population P ',
Go to step ii;If characteristics tree non-empty jumps directly to step ii;
Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0;If
Cur_node is leaf node, and go to step iii;If child nodes that there are two pointer node Cur_node, and current signature
Selection scheme XiXi(depth(Cur_node))=Xi(depth (Cur_node))=1, then pointer node Cur_node is directed toward left child
Node, if Xi(depth (Cur_node))=0, then Cur_node is directed toward right child nodes;Wherein depth (Cur_node) table
Show the depth of pointer node Cur_node;Current procedures are repeated until pointer node Cur_node is leaf node;
Step iii: if being currently inserted into feature selecting scheme XiIdentical as pointer node Cur_node, then flag=1, enablesGo to step iv;Judge Xi(depth(Cur_
When node)) ≠ Cur_node (depth (Cur_node)), go to step iv;Work as Xi(depth (Cur_node))=Cur_
When node (depth (Cur_node)), pointer node Cur_node and current optimal feature selection scheme X are calculated separatelybest's
Hamming distances d1=hamming (Cur_node, Xbest) and be inserted into feature selecting scheme XiWith XbestHamming distances d2=
hamming(Xi,Xbest);
If d1<d2, enableJudge f (Cur_
node)<f(Xbest), then Xbest=Cur_node, otherwise constant, renewal function evaluates number t=t+1, and go to step iv;If
d1>d2, then enable Go to step iv;
Step iv: if Xi(depth (Cur_node))=1, then by XiLeft child nodes as Cur_node are inserted into characteristics tree
In, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;If Xi(depth (Cur_node))=0, then
By XiAs in the right child nodes insertion characteristics tree of Cur_node, inserted Cur_node as the left child nodes of Cur_node
Enter in characteristics tree;
Step v: by XiIt stores into optimization population P ', if f (Xi)<f(Xbest), enable Xbest=Xi, it is otherwise constant, lead to after operation
Cross t=t+1 renewal function evaluation number;If t >=MaxEval or mod (i/N)=0, population P ', optimal feature selection will be optimized
Scheme XbestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod (i/N) respectively
The i/N remainder of expression;Otherwise, go to step i.
6. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 5, feature exist
In: method described in step E are as follows: parent population of the population P ' as genetic operator will be optimized, execute uniform crossover operator and mark
Level mutation operation obtains sub- population P ", is executed by guiding searching probability from selected section feature selecting scheme in sub- population P "
Guiding search;And optimal feature selection scheme X is substituted for by preceding osr of the Partial Feature selection scheme selectedbestBefore
Osr values obtain optimal population P " ';Initial population P is merged with optimal population P " ', using elite environmental selection strategy,
Selection enters follow-on population `P;Wherein crossover probability pc∈ [0.5,0.8], mutation probability pr=1/n is oriented to searching probability
It is 0.05~0.1.
7. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 6, feature exist
In: if t < MaxEval, P=`P is enabled, optimization population P ' is emptied, return step D;If t >=MaxEval, output optimal characteristics choosing
Select scheme Xbest, and optimal feature selection scheme X is verified on test setbestClassification error rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910355711.3A CN110070916B (en) | 2019-04-29 | 2019-04-29 | Historical data-based cancer disease gene characteristic selection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910355711.3A CN110070916B (en) | 2019-04-29 | 2019-04-29 | Historical data-based cancer disease gene characteristic selection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110070916A true CN110070916A (en) | 2019-07-30 |
CN110070916B CN110070916B (en) | 2023-04-18 |
Family
ID=67369518
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910355711.3A Active CN110070916B (en) | 2019-04-29 | 2019-04-29 | Historical data-based cancer disease gene characteristic selection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110070916B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243662A (en) * | 2020-01-15 | 2020-06-05 | 云南大学 | Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost |
CN112580606A (en) * | 2020-12-31 | 2021-03-30 | 安徽大学 | Large-scale human body behavior identification method based on clustering grouping |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN109242100A (en) * | 2018-09-07 | 2019-01-18 | 浙江财经大学 | A kind of Niche Genetic method on multiple populations for feature selecting |
-
2019
- 2019-04-29 CN CN201910355711.3A patent/CN110070916B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN109242100A (en) * | 2018-09-07 | 2019-01-18 | 浙江财经大学 | A kind of Niche Genetic method on multiple populations for feature selecting |
Non-Patent Citations (1)
Title |
---|
范方云;孙俊;: "基于BQPSO算法的癌症特征基因选择与分类" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243662A (en) * | 2020-01-15 | 2020-06-05 | 云南大学 | Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost |
CN111243662B (en) * | 2020-01-15 | 2023-04-21 | 云南大学 | Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost |
CN112580606A (en) * | 2020-12-31 | 2021-03-30 | 安徽大学 | Large-scale human body behavior identification method based on clustering grouping |
CN112580606B (en) * | 2020-12-31 | 2022-11-08 | 安徽大学 | Large-scale human body behavior identification method based on clustering grouping |
Also Published As
Publication number | Publication date |
---|---|
CN110070916B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Binary differential evolution with self-learning for multi-objective feature selection | |
Khushi et al. | A comparative performance analysis of data resampling methods on imbalance medical data | |
Reddy et al. | An efficient system for heart disease prediction using hybrid OFBAT with rule-based fuzzy logic model | |
Molinaro et al. | Tree-based multivariate regression and density estimation with right-censored data | |
Su et al. | Facilitating score and causal inference trees for large observational studies | |
CN105868775A (en) | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm | |
CN108304887A (en) | Naive Bayesian data processing system and method based on the synthesis of minority class sample | |
Zhong et al. | A trivariate continual reassessment method for phase I/II trials of toxicity, efficacy, and surrogate efficacy | |
Dubey et al. | A cluster-level semi-supervision model for interactive clustering | |
CN110211684A (en) | The electrocardiogram classification method of BP neural network based on genetic algorithm optimization | |
CN110070916A (en) | A kind of Cancerous disease gene expression characteristics selection method based on historical data | |
CN110110753B (en) | Effective mixed characteristic selection method based on elite flower pollination algorithm and ReliefF | |
Yip et al. | A survey of classification techniques for microarray data analysis | |
Mathew et al. | A multimodal adaptive approach on soft set based diagnostic risk prediction system | |
CN112801140A (en) | XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm | |
Yang et al. | Feature selection using memetic algorithms | |
Cheng et al. | Maximizing receiver operating characteristics convex hull via dynamic reference point-based multi-objective evolutionary algorithm | |
Huang et al. | Regularized continuous-time Markov model via elastic net | |
Xu et al. | Gene mutation classification using CNN and BiGRU network | |
Wang et al. | Development of a normal tissue complication probability (NTCP) model using an artificial neural network for radiation-induced necrosis after carbon ion re-irradiation in locally recurrent nasopharyngeal carcinoma | |
Ren et al. | Multiple sparse detection-based evolutionary algorithm for large-scale sparse multiobjective optimization problems | |
Banerjee et al. | Tree-based methods for survival data | |
Vivanco et al. | Identifying effective software metrics using genetic algorithms | |
CN109360656A (en) | A kind of method for detecting cancer based on multi-objective evolutionary algorithm | |
Chen et al. | Artificial neural network prediction for cancer survival time by gene expression data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |