CN110070916A - A kind of Cancerous disease gene expression characteristics selection method based on historical data - Google Patents

A kind of Cancerous disease gene expression characteristics selection method based on historical data Download PDF

Info

Publication number
CN110070916A
CN110070916A CN201910355711.3A CN201910355711A CN110070916A CN 110070916 A CN110070916 A CN 110070916A CN 201910355711 A CN201910355711 A CN 201910355711A CN 110070916 A CN110070916 A CN 110070916A
Authority
CN
China
Prior art keywords
node
cur
feature
population
scheme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910355711.3A
Other languages
Chinese (zh)
Other versions
CN110070916B (en
Inventor
邱剑锋
郭能
张兴义
苏延森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN201910355711.3A priority Critical patent/CN110070916B/en
Publication of CN110070916A publication Critical patent/CN110070916A/en
Application granted granted Critical
Publication of CN110070916B publication Critical patent/CN110070916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Physiology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The Cancerous disease gene expression characteristics selection method based on historical data that the invention discloses a kind of, comprising the following steps: Cancerous disease gene data A: is divided into training set and test set;B: the overall average error rate after all features are all selected on training set is calculated;C: generating initial population, constructs fitness function;D: all feature selecting schemes are recorded in characteristics tree, adjust the distribution of feature selecting scheme, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme, return result to genetic operator and guiding searching operators;E: the Evolutionary direction of guide features population;F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, exports optimal solution.The present invention has the advantages that can be effectively reduced data dimension, forecasting accuracy is improved, by characteristics tree, is screened in conjunction with the related gene that genetic algorithm is the diseases such as cancer, provides auxiliary for diagnosing and treating.

Description

A kind of Cancerous disease gene expression characteristics selection method based on historical data
Technical field
The present invention relates to the Disease-causing gene screening technique field of disease more particularly to a kind of cancer diseases based on historical data Ospc gene feature selection approach.
Background technique
The cancer malignant tumour most commonly seen as the mankind, seriously affects the physical and mental health of people, causes different field Experts and scholars great attention.Since cancer patient can generate a large amount of clinical number during inspection, treatment and medication According to.These data for prediction malignant tumour there is a situation where the development of, disease to have very important significance.But these data Often have the characteristics that high dimension, diversification, data corruption and small sample.
With the rapid development of computer technology, it is thought that going to handle these complex datas using computer.By building Relevant machine learning prediction model is found, predicts the generation and development of cancer, the rehabilitation and postoperative for patient The relevant technologies guidance and advisory opinion are provided.The exception for higher-dimension, redundancy and the data that Cancerous disease gene data has is scarce The features such as mistake, therefore before handling these data, it usually needs cleaning is carried out to data, gets rid of uncorrelated and redundancy Data.
The simple cleaning of data not can effectively solve the high-dimensional problem of Cancerous disease gene data.Therefore, for High-dimensional clinical data also needs to develop specific engineering method and targetedly algorithm, to reach dimensionality reduction and improve prediction mould The purpose of type accuracy.It should go to design and develop correlation technique in terms of following two, first be exactly that model should have reality With property, prediction model needs doctor that the case where patient is input to predicting platform in being embedded into assessment system, if prediction needs A large amount of sample data is wanted, the burden of doctor is often will increase, does not have practical function.Secondly, it is also necessary to consider model prediction Accuracy, accurate prediction model can assist the diagnosis of doctor, rehabilitation to patient and reduce disease palindromia Risk provides suggestion.
Therefore, for Cancerous disease gene data, it can choose there has been proposed a large amount of method and provided important feature Data, reduce the dimension of data, improve the accuracy rate of prediction model.These methods mainly have single factor analysis method, recurrence Feature removing method and feature importance analysis method etc..However existing method often exists that dimensionality reduction effect is unobvious, redundancy is special It goes on a punitive expedition the problems such as more, time complexity is excessively high and not very practical.For the feature selection issues of high dimensional data, develop Engineering method needs under the premise of can guarantee prediction model accuracy, reduces the time complexity of method.
Summary of the invention
Technical problem to be solved by the present invention lies in providing one kind to can be effectively reduced data dimension, prediction model is improved The cancer gene feature selection approach of accuracy.
The present invention is to solve above-mentioned technical problem by the following technical programs:
A kind of Cancerous disease gene expression characteristics selection method based on historical data, comprising the following steps:
Step A: Cancerous disease gene data is divided into training dataset and test data set;
Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets;
Step C: it is random to generate initialization feature population, it is configured to the fitness function of evaluating characteristic selection scheme;
Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, adjustment feature selecting scheme Distribution, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme;
Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population;Characteristics tree is optimized Feature population as parent population, utilize genetic operator to generate progeny population;By the feature where optimal feature selection scheme The location information of tree reinforces local search as the direction of search, using guiding searching operators;
Step F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, output Optimal feature selection scheme verifies the classification error rate of optimal feature selection scheme on test set.
Preferably, after all being selected described in step B using all features on five folding interior extrapolation methods calculating training dataset The method for error rate of averagely classifying are as follows:
Training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts are obtained as training set Take error rateWherein CEkIndicate kth part as test set when feature selecting scheme in classification error sample Number, N indicate the total sample number of test set;To obtain five error rates, it is denoted as Erro={ erro respectively1,erro2,erro3, erro4,erro5, and vision response test is calculated
Preferably, the initialization feature population is two-dimensional matrix, is expressed as P={ X1, X2,···,XN, wherein N For Population Size;Xi={ xi1,xi2,···,xinIt is string of binary characters, indicate a kind of feature selecting scheme;Work as xij=1 When, it indicates that j-th of characterizing gene of ith feature selection scheme is selected, works as xijWhen=0, indicate that corresponding feature is not chosen It selects;N indicates characterizing gene sum.
Preferably, the fitness function f constructed in step C are as follows:
Wherein, NpIndicate current signature selection scheme XiIn be selected feature quantity,It is characterized selection scheme Xi Vision response test;α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function;Wherein αt∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most Big function evaluates number.
Preferably, characteristics tree described in step D is binary tree, all feature selecting schemes by initial population The method being recorded in characteristics tree includes the following steps;
Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i= 1, the setting value of MaxEval is inputted, the optimization population P ' for being used for the feature selecting scheme after storage optimization is emptied, initialization refers to Pin knot point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature Selection scheme XiIt is inserted into root node, optimal feature selection scheme Xbest=Xi, by feature selecting scheme XiIt stores to optimization population In P ', go to step ii;If characteristics tree non-empty jumps directly to step ii;
Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0; If Cur_node is leaf node, go to step iii;If there are two child nodes by pointer node Cur_node, and currently special Levy selection scheme XiXi(depth(Cur_node))=Xi(depth (Cur_node))=1, then pointer node Cur_node is directed toward left Child nodes, if Xi(depth (Cur_node))=0, then Cur_node is directed toward right child nodes;Wherein depth (Cur_ Node the depth of pointer node Cur_node) is indicated;Current procedures are repeated until pointer node Cur_node is leaf node;
Step iii: if being currently inserted into feature selecting scheme XiIdentical as pointer node Cur_node, then flag=1, enablesGo to step iv;Judge Xi(depth(Cur_ When node)) ≠ Cur_node (depth (Cur_node)), go to step iv;Work as Xi(depth (Cur_node))=Cur_ When node (depth (Cur_node)), pointer node Cur_node and current optimal feature selection scheme X are calculated separatelybest's Hamming distances d1=hamming (Cur_node, Xbest) and be inserted into feature selecting scheme XiWith XbestHamming distances d2= hamming(Xi,Xbest);
If d1< d2, enableJudge f (Cur_node) < f (Xbest), then Xbest=Cur_node, otherwise constant, renewal function evaluates number t=t+1, jumps to step Rapid iv;If d1> d2, then enable Go to step iv;
Step iv: if Xi(depth (Cur_node))=1, then by XiLeft child nodes as Cur_node are inserted into feature In tree, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;If Xi(depth (Cur_node))=0, Then by XiRight child nodes as Cur_node are inserted into characteristics tree, using Cur_node as the left child nodes of Cur_node It is inserted into characteristics tree;
Step v: by XiIt stores into optimization population P ', if f (Xi) < f (Xbest), enable Xbest=Xi, otherwise constant, fortune Number is evaluated by t=t+1 renewal function after calculation;If t >=MaxEval or mod (i/N)=0, population P ', optimal spy will be optimized Levy selection scheme XbestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod respectively (i/N) the i/N remainder indicated;Otherwise, go to step i.
Preferably, method described in step E are as follows: parent population of the population P ' as genetic operator will be optimized, executed uniform Crossover operation and normal bit mutation operation obtain sub- population P ", by being oriented to searching probability from selected section feature in sub- population P " Selection scheme executes guiding search;And optimal feature selection is substituted for by preceding osr of the Partial Feature selection scheme selected Scheme XbestPreceding osr values, obtain optimal population P " ';Initial population P is merged with optimal population P " ', using elite Environmental selection strategy, selection enter follow-on population `P;Wherein crossover probability pc∈ [0.5,0.8], mutation probability pr=1/n, Being oriented to searching probability is 0.05~0.1.
Preferably, if t < MaxEval, enables P=`P, optimization population P ' is emptied, return step D;If t >=MaxEval, Export optimal feature selection scheme Xbest, and optimal feature selection scheme X is verified on test setbestClassification error rate.
A kind of the advantages of cancer gene feature selection approach based on historical data provided by the invention, is: can be effective Data dimension is reduced, forecasting accuracy is improved, is carried out by the related gene that characteristics tree combination genetic algorithm is the diseases such as cancer Screening, provides auxiliary for diagnosing and treating.
Detailed description of the invention
Fig. 1 is the stream of the Cancerous disease gene expression characteristics selection method based on historical data provided by the embodiment of the present invention Cheng Tu;
Fig. 2 is the algorithm flow chart for passing through feature tree optimization population provided by the embodiment of the present invention;
Fig. 3 is the method schematic one by feature tree optimization population that the embodiment of the present invention provides;
Fig. 4 is the method schematic two by feature tree optimization population that the embodiment of the present invention provides;
Fig. 5 is the method schematic three by feature tree optimization population that the embodiment of the present invention provides;
Fig. 6 is the method schematic four by feature tree optimization population that the embodiment of the present invention provides;
Fig. 7 is the method schematic five for passing through feature tree optimization population provided by the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in further detail.
As shown in Figure 1, present embodiments providing a kind of Cancerous disease gene expression characteristics selection method based on historical data, wrap Include following steps:
Step A: Cancerous disease gene data is divided into training dataset and test data set;Specially by Cancerous disease base Because data are divided into ten parts, 7 parts are used as training dataset, and 3 parts are used as test data set.
Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets, tool Body are as follows: training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts wrong as training set acquisition Accidentally rateWherein CEkIndicate kth part as test set when feature selecting scheme in classification error sample number, N Indicate the total sample number of test set;Five error rates are obtained by calculating, are expressed as Erro={ erro1,erro2, erro3,erro4,erro5, and vision response test is calculatedAbove-mentioned steps are that this field uses at genetic algorithm The prior art for managing sample data, herein without being discussed in detail.
Step C: random to generate initialization feature population, the initialization feature population is two-dimensional matrix, is expressed as P= {X1, X2,···,XN, wherein N is Population Size; Xi={ xi1,xi2,···,xinIt is string of binary characters, indicate one Kind feature selecting scheme;Work as xijWhen=1, indicates that j-th of characterizing gene of ith feature selection scheme is selected, work as xij=0 When, indicate that corresponding feature is unselected;N indicates characterizing gene sum;
It is configured to the fitness function f of evaluating characteristic selection scheme:
Wherein, NpIndicate current signature selection scheme XiIn be selected feature quantity,It is characterized selection scheme Xi Vision response test;α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function;Wherein αt∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most Big function evaluates number.
Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, adjustment feature selecting scheme Distribution, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme;Institute Stating characteristics tree is binary tree, the method specifically:
Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i= 1, the setting value of MaxEval is inputted, the optimization population P ' for being used for the feature selecting scheme after storage optimization is emptied, initialization refers to Pin knot point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature Selection scheme XiIt is inserted into root node, optimal feature selection scheme Xbest=Xi, by feature selecting scheme XiIt stores to optimization population In P ', go to step ii;If characteristics tree non-empty jumps directly to step ii;
Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0; If Cur_node is leaf node, go to step iii;If there are two child nodes by pointer node Cur_node, and currently special Levy selection scheme XiXi(depth(Cur_node))=Xi(depth (Cur_node))=1, then pointer node Cur_node is directed toward left Child nodes, if Xi(depth (Cur_node))=0, then Cur_node is directed toward right child nodes;Wherein depth (Cur_ Node the depth of pointer node Cur_node) is indicated;Current procedures are repeated until pointer node Cur_node is leaf node;
Step iii: if being currently inserted into feature selecting scheme XiIdentical as pointer node Cur_node, then flag=1, enablesGo to step iv;Judge Xi(depth(Cur_ When node)) ≠ Cur_node (depth (Cur_node)), go to step iv;Work as Xi(depth (Cur_node))=Cur_ When node (depth (Cur_node)), pointer node Cur_node and current optimal feature selection scheme X are calculated separatelybest's Hamming distances d1=hamming (Cur_node, Xbest) and be inserted into feature selecting scheme XiWith XbestHamming distances d2= hamming(Xi,Xbest);
If d1< d2, enableJudge f (Cur_node) < f (Xbest), then Xbest=Cur_node, otherwise constant, renewal function evaluates number t=t+1, jumps to step Rapid iv;If d1> d2, then enable Go to step iv;
Step iv: if Xi(depth (Cur_node))=1, then by XiLeft child nodes as Cur_node are inserted into feature In tree, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;If Xi(depth (Cur_node))=0, Then by XiRight child nodes as Cur_node are inserted into characteristics tree, using Cur_node as the left child nodes of Cur_node It is inserted into characteristics tree;
Step v: by XiIt stores into optimization population P ', if f (Xi) < f (Xbest), enable Xbest=Xi, otherwise constant, fortune Number is evaluated by t=t+1 renewal function after calculation;If t >=MaxEval or mod (i/N)=0, population P ', optimal spy will be optimized Levy selection scheme XbestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod respectively (i/N) the i/N remainder indicated;If t < MaxEval, go to step i.
Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population: population P ' work will be optimized For the parent population of genetic operator, executes uniform crossover operator and normal bit mutation operation obtains sub- population P ", searched by guiding Rope probability executes guiding search from selected section feature selecting scheme in sub- population P ";And the Partial Feature selecting party that will be selected Preceding osr of case are substituted for optimal feature selection scheme XbestPreceding osr values, obtain optimal population P " ';By initial kind Group P merges with optimal population P " ', and using elite environmental selection strategy, selection enters follow-on population `P;Wherein crossover probability pc∈ [0.5,0.8], mutation probability pr=1/n, guiding searching probability are 0.05~0.1.
Step F: judging termination condition, if t < MaxEval, enables P=`P, optimization population P ' is emptied, return step D;If T >=MaxEval exports optimal feature selection scheme Xbest, and optimal feature selection scheme X is verified on test setbestClassification Error rate.
Below using the data of table 1 as initial population P to arrangement mode of the adjustment feature selecting scheme in characteristics tree into Row specific explanations:
Table 1: initial population
Parameter is configured first, as shown in Table 1, N=5, then mutation probability pr=1/6, being oriented to searching probability is 0.05, enable pc=0.5, αt=0.2;Characteristics tree has been chosen, initialization function evaluates number t=0, i=1, MaxEval=10, With reference to Fig. 3, by XiIt has been inserted into the root node of characteristics tree, has enabled optimal feature selection scheme Xbest=Xi, f (X at this timebest)=5, By XiIt stores into sky optimization population P '.
It is inserted into feature selecting scheme X2:
In step ii, i=2, pointer node Cur_node are directed toward root node, i.e. Cur_node=X1, repeated accesses mark Remember flag=0;Cur_node is leaf node at this time, and go to step iii, and being currently inserted into feature selecting scheme is X2, with Cur_node is different, flag=0, at this time depth (Cur_node)=1, it is known that X2(depth (Cur_node))=1, with finger Pin knot point Cur_node (depth (Cur_node))=1 is identical, then judges Cur_node and X respectively2With XbestHamming distances, As shown in figure 4, d1=0, d2=4;Therefore, d1< d2, executeEven x11=0, by Variation has been given birth in pointer node Cur_node, has needed to reappraise fitness, at this time f (XCur_node)=f (Xbest)=2, Xbest It does not change, renewal function evaluates number t=1, and go to step iv;X at this time2(depth (Cur_node))=1, therefore By X2As the left child nodes of pointer node Cur_node, modified pointer node Cur_node as pointer node Cur_ In the right child nodes insertion characteristics tree of node, by X2It stores into optimization population P ', at this time f (Xbest< f (the X of)=22)=4, XbestIt does not change, renewal function evaluates number to t=2, at this time t < MaxEval and mod (i/N) ≠ 0, and go to step i Continue to execute insertion operation.
It is inserted into feature selecting scheme X3:
With reference to Fig. 5, characteristics tree non-empty, executes step ii at this time, and i=3, pointer node Cur_node are directed toward root node, this When there are two child nodes, calculate the X of current signature selection scheme3(depth (Cur_node))=x31=1, pointer node refers to To left child nodes, i.e. Cur_node=X2, pointer node Cur_node is leaf node at this time, executes step iii;Because of X3 It is different from Cur_node, i.e. flag=0, at this time x32=0, and depth (Cur_node) (depth (Cur_ of pointer node Node))=x22=0, it is therefore desirable to calculate pointer node Cur_node and X3Respectively with XbestHamming distances, obtain d1=4 > d2=2, it enablesThat is x32=1, iv is entered step, By X3As the left child nodes of Cur_node, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree; By X3It stores into population P ', at this time XbestDo not change, does not need to reappraise fitness, therefore f (Xbest)=2, to insertion XiEvaluation fitness, obtain f (Xbest< f (the X of)=23)=5, therefore XbestIt is constant, renewal function evaluation number to t=3, this When t < MaxEval and mod (i/N) ≠ 0, the i that gos to step continue to execute insertion operation.
It is inserted into feature selecting scheme X4:
With reference to Fig. 6, characteristics tree non-empty, executes step ii at this time, and i=4, pointer node Cur_node are directed toward root node, this When there are two child nodes, calculate the X of current signature selection scheme4(depth (Cur_node))=x41=1, pointer node refers to To left child nodes, i.e. Cur_node=X2, still there are two child nodes by pointer node Cur_node at this time, continue to execute Step ii, at this time X4(depth (Cur_node))=x42=0, pointer node is directed toward right child nodes, i.e. Cur_node=X2, Pointer node Cur_node is leaf node, executes step iii;X at this time is known that according to table 14=Cur_node, i.e. flag =1, it enablesThat is x43=1, go to step iv, after update X4As the left child nodes of Cur_node, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree; By X4It stores into population P ', at this time f (Xbest> f (the X of)=24)=1, enables Xbest=X4, renewal function evaluation number to t=3, T < MaxEval and mod (i/N) ≠ 0, the i that gos to step continue to execute insertion operation at this time.
It is inserted into feature selecting scheme X5:
With reference to Fig. 7, characteristics tree non-empty, executes step ii, i=5 at this time, and pointer node Cur_node is directed toward root node, There are two child nodes, calculate the X of current signature selection scheme5(depth (Cur_node))=x51=0, pointer node is directed toward Right child nodes, i.e. Cur_node=X1, pointer node Cur_node is leaf node at this time, step iii is executed, because Cur_node and X5Difference, flag=0, at this time Cur_node (depth (Cur_node))=x of pointer node12=0, and to It is inserted into the X of featured aspects5(depth (Cur_node))=x52=1, step iv is executed, feature selecting scheme X will be inserted into5Make For the left child nodes of Cur_node, it is inserted into Cur_node as the right child nodes of Cur_node in characteristics tree;By X4It deposits It stores up into population P ', at this time f (Xbest)=1 < f (X5)=6, XbestConstant, renewal function evaluates number to t=4, at this time mod (i/N)=0 population P ', optimal feature selection scheme X, will be optimizedbestAnd its depth osr in characteristics tree is fed back to respectively Genetic operator and guiding searching operators, by using in existing genetic algorithm uniform crossover operator and normal bit mutation operation at Reason optimization population P ', guidance Evolution of Population finally obtain population `P, and t < MaxEval, enables P=`P at this time, and optimization population P ' is set Sky, return step D continues to execute insertion operation, until meeting termination condition.
The present embodiment founds university scikit-feature open source feature selecting library item mesh by using State of Arizona, US In data set verify the validity of algorithm provided by the present embodiment.As shown in table 2, a colon cancer data set is had chosen Colon, a lung cancer data set Lung and a nervousness tumor data set Glioma are verified, and select error rate and feature Quantity is lower as evaluation index, specially error rate, and feature quantity is smaller, and algorithm performance is better.
Table 2: Cancerous disease gene data
It chooses classical Genetic Algorithms and particle swarm optimization algorithm PSO and algorithm provided by the embodiment carries out emulation in fact Comparison is tested, the results are shown in Table 3, wherein algorithm provided in this embodiment is defined as BSPGA, can be seen that from experimental result The performance of algorithm provided in this embodiment is substantially better than GA algorithm and PSO algorithm.
Table 3: emulation experiment comparison
The application starts with to illustrate the validity and usage scenario of algorithm with cancer data, but for other and genetic mutation This algorithm is equally effective for relevant disease.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects It describes in detail bright, it should be understood that the above is only a specific embodiment of the present invention, is not intended to restrict the invention, not It is any modification that those of ordinary skill in the art make the present invention, equivalent under the premise of being detached from the spirit and principles in the present invention Replacement, improvement etc., should all fall within the protection scope that claims of the present invention determines.

Claims (7)

1. a kind of Cancerous disease gene expression characteristics selection method based on historical data, it is characterised in that: the following steps are included:
Step A: Cancerous disease gene data is divided into training dataset and test data set;
Step B: the overall average error rate after all being selected using all features on five folding calculated crosswise training datasets;
Step C: it is random to generate initialization feature population, it is configured to the fitness function of evaluating characteristic selection scheme;
Step D: the feature selecting scheme in feature population is successively recorded in characteristics tree, point of adjustment feature selecting scheme Cloth, the feature population after being adjusted, using the smallest feature selecting scheme of fitness value as optimal feature selection scheme.
Step E: according to the Evolutionary direction of genetic operator and guiding searching operators guide features population;The spy that characteristics tree is optimized Population is levied as parent population, generates progeny population using genetic operator;By the characteristics tree where optimal feature selection scheme Location information reinforces local search as the direction of search, using guiding searching operators;
Step F: judging termination condition, if not up to termination condition, repeats step D~F, if reaching termination condition, exports optimal Feature selecting scheme verifies the classification error rate of optimal feature selection scheme on test set.
2. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 1, feature exist In: the average classification error after all being selected described in step B using all features on five folding interior extrapolation methods calculating training dataset The method of rate are as follows:
Training dataset is divided into five parts, portion is successively chosen and is used as test set, other four parts wrong as training set acquisition Accidentally rateWherein CEkIndicate kth part as test set when feature selecting scheme in classification error sample number, N Indicate the total sample number of test set;To obtain five error rates, it is denoted as Erro={ erro respectively1,erro2,erro3, erro4,erro5, and vision response test is calculated
3. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 2, feature exist In: the initialization feature population is two-dimensional matrix, is expressed as P={ X1, X2,…,XN, wherein N is Population Size;Xi= {xi1,xi2,…,xinIt is string of binary characters, indicate a kind of feature selecting scheme;Work as xijWhen=1, ith feature selection is indicated J-th of characterizing gene of scheme is selected, and x is worked asijWhen=0, indicate that corresponding feature is unselected;N indicates characterizing gene sum.
4. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 3, feature exist In: the fitness function f constructed in step C are as follows:
Wherein, NpIndicate current signature selection scheme XiIn be selected feature quantity,It is characterized selection scheme XiIt is flat Equal error rate;α is variable coefficient, the weight for adjusting error rate He choosing Characteristic Number in function;Wherein αt∈ [0.15,0.4] is parameter preset, and t indicates that Evaluation: Current number, MaxEval are most Big function evaluates number.
5. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 4, feature exist In: characteristics tree described in step D is binary tree, and characteristics tree is recorded in all feature selecting schemes by initial population In method include the following steps;
Step i: the data in detection characteristics tree, if characteristics tree is sky, initialization function evaluates number t=0, i=1, defeated The setting value for entering MaxEval empties the optimization population P ' for being used for the feature selecting scheme after storage optimization, initialization pointers section Point Cur_node is sky, the feature selecting scheme that the feature tree node that pointer node is directed toward is stored is indicated, by feature selecting Scheme XiIt is inserted into root node, optimal feature selection scheme Xbest=Xi, by feature selecting scheme XiIt stores into optimization population P ', Go to step ii;If characteristics tree non-empty jumps directly to step ii;
Step ii: enabling i=i+1, pointer node Cur_node be directed toward root node, and initialization repeated accesses mark flag=0;If Cur_node is leaf node, and go to step iii;If child nodes that there are two pointer node Cur_node, and current signature Selection scheme XiXi(depth(Cur_node))=Xi(depth (Cur_node))=1, then pointer node Cur_node is directed toward left child Node, if Xi(depth (Cur_node))=0, then Cur_node is directed toward right child nodes;Wherein depth (Cur_node) table Show the depth of pointer node Cur_node;Current procedures are repeated until pointer node Cur_node is leaf node;
Step iii: if being currently inserted into feature selecting scheme XiIdentical as pointer node Cur_node, then flag=1, enablesGo to step iv;Judge Xi(depth(Cur_ When node)) ≠ Cur_node (depth (Cur_node)), go to step iv;Work as Xi(depth (Cur_node))=Cur_ When node (depth (Cur_node)), pointer node Cur_node and current optimal feature selection scheme X are calculated separatelybest's Hamming distances d1=hamming (Cur_node, Xbest) and be inserted into feature selecting scheme XiWith XbestHamming distances d2= hamming(Xi,Xbest);
If d1<d2, enableJudge f (Cur_ node)<f(Xbest), then Xbest=Cur_node, otherwise constant, renewal function evaluates number t=t+1, and go to step iv;If d1>d2, then enable Go to step iv;
Step iv: if Xi(depth (Cur_node))=1, then by XiLeft child nodes as Cur_node are inserted into characteristics tree In, using Cur_node as in the right child nodes of Cur_node insertion characteristics tree;If Xi(depth (Cur_node))=0, then By XiAs in the right child nodes insertion characteristics tree of Cur_node, inserted Cur_node as the left child nodes of Cur_node Enter in characteristics tree;
Step v: by XiIt stores into optimization population P ', if f (Xi)<f(Xbest), enable Xbest=Xi, it is otherwise constant, lead to after operation Cross t=t+1 renewal function evaluation number;If t >=MaxEval or mod (i/N)=0, population P ', optimal feature selection will be optimized Scheme XbestAnd its depth osr in characteristics tree feeds back to genetic operator and guiding searching operators, wherein mod (i/N) respectively The i/N remainder of expression;Otherwise, go to step i.
6. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 5, feature exist In: method described in step E are as follows: parent population of the population P ' as genetic operator will be optimized, execute uniform crossover operator and mark Level mutation operation obtains sub- population P ", is executed by guiding searching probability from selected section feature selecting scheme in sub- population P " Guiding search;And optimal feature selection scheme X is substituted for by preceding osr of the Partial Feature selection scheme selectedbestBefore Osr values obtain optimal population P " ';Initial population P is merged with optimal population P " ', using elite environmental selection strategy, Selection enters follow-on population `P;Wherein crossover probability pc∈ [0.5,0.8], mutation probability pr=1/n is oriented to searching probability It is 0.05~0.1.
7. a kind of Cancerous disease gene expression characteristics selection method based on historical data according to claim 6, feature exist In: if t < MaxEval, P=`P is enabled, optimization population P ' is emptied, return step D;If t >=MaxEval, output optimal characteristics choosing Select scheme Xbest, and optimal feature selection scheme X is verified on test setbestClassification error rate.
CN201910355711.3A 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method Active CN110070916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910355711.3A CN110070916B (en) 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910355711.3A CN110070916B (en) 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method

Publications (2)

Publication Number Publication Date
CN110070916A true CN110070916A (en) 2019-07-30
CN110070916B CN110070916B (en) 2023-04-18

Family

ID=67369518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910355711.3A Active CN110070916B (en) 2019-04-29 2019-04-29 Historical data-based cancer disease gene characteristic selection method

Country Status (1)

Country Link
CN (1) CN110070916B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243662A (en) * 2020-01-15 2020-06-05 云南大学 Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost
CN112580606A (en) * 2020-12-31 2021-03-30 安徽大学 Large-scale human body behavior identification method based on clustering grouping

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN109242100A (en) * 2018-09-07 2019-01-18 浙江财经大学 A kind of Niche Genetic method on multiple populations for feature selecting

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN109242100A (en) * 2018-09-07 2019-01-18 浙江财经大学 A kind of Niche Genetic method on multiple populations for feature selecting

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范方云;孙俊;: "基于BQPSO算法的癌症特征基因选择与分类" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243662A (en) * 2020-01-15 2020-06-05 云南大学 Pan-cancer gene pathway prediction method, system and storage medium based on improved XGboost
CN111243662B (en) * 2020-01-15 2023-04-21 云南大学 Method, system and storage medium for predicting genetic pathway of pan-cancer based on improved XGBoost
CN112580606A (en) * 2020-12-31 2021-03-30 安徽大学 Large-scale human body behavior identification method based on clustering grouping
CN112580606B (en) * 2020-12-31 2022-11-08 安徽大学 Large-scale human body behavior identification method based on clustering grouping

Also Published As

Publication number Publication date
CN110070916B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
Zhang et al. Binary differential evolution with self-learning for multi-objective feature selection
Khushi et al. A comparative performance analysis of data resampling methods on imbalance medical data
Reddy et al. An efficient system for heart disease prediction using hybrid OFBAT with rule-based fuzzy logic model
Molinaro et al. Tree-based multivariate regression and density estimation with right-censored data
Su et al. Facilitating score and causal inference trees for large observational studies
CN105868775A (en) Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN108304887A (en) Naive Bayesian data processing system and method based on the synthesis of minority class sample
Zhong et al. A trivariate continual reassessment method for phase I/II trials of toxicity, efficacy, and surrogate efficacy
Dubey et al. A cluster-level semi-supervision model for interactive clustering
CN110211684A (en) The electrocardiogram classification method of BP neural network based on genetic algorithm optimization
CN110070916A (en) A kind of Cancerous disease gene expression characteristics selection method based on historical data
CN110110753B (en) Effective mixed characteristic selection method based on elite flower pollination algorithm and ReliefF
Yip et al. A survey of classification techniques for microarray data analysis
Mathew et al. A multimodal adaptive approach on soft set based diagnostic risk prediction system
CN112801140A (en) XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm
Yang et al. Feature selection using memetic algorithms
Cheng et al. Maximizing receiver operating characteristics convex hull via dynamic reference point-based multi-objective evolutionary algorithm
Huang et al. Regularized continuous-time Markov model via elastic net
Xu et al. Gene mutation classification using CNN and BiGRU network
Wang et al. Development of a normal tissue complication probability (NTCP) model using an artificial neural network for radiation-induced necrosis after carbon ion re-irradiation in locally recurrent nasopharyngeal carcinoma
Ren et al. Multiple sparse detection-based evolutionary algorithm for large-scale sparse multiobjective optimization problems
Banerjee et al. Tree-based methods for survival data
Vivanco et al. Identifying effective software metrics using genetic algorithms
CN109360656A (en) A kind of method for detecting cancer based on multi-objective evolutionary algorithm
Chen et al. Artificial neural network prediction for cancer survival time by gene expression data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant