CN106228034A - A kind of method for mixing and optimizing of tumor-related gene search - Google Patents

A kind of method for mixing and optimizing of tumor-related gene search Download PDF

Info

Publication number
CN106228034A
CN106228034A CN201610555700.6A CN201610555700A CN106228034A CN 106228034 A CN106228034 A CN 106228034A CN 201610555700 A CN201610555700 A CN 201610555700A CN 106228034 A CN106228034 A CN 106228034A
Authority
CN
China
Prior art keywords
gene
collection
svm
rfe
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610555700.6A
Other languages
Chinese (zh)
Inventor
李小波
田中娟
叶晓平
叶振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lishui University
Original Assignee
Lishui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lishui University filed Critical Lishui University
Priority to CN201610555700.6A priority Critical patent/CN106228034A/en
Publication of CN106228034A publication Critical patent/CN106228034A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computing Systems (AREA)
  • Analytical Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Physiology (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses the method for mixing and optimizing of a kind of tumor-related gene search, step includes: step 1, utilize support vector machine recursive feature elimination algorithm to obtain " ranked genes collection ";Step 2, set up candidate gene collection Ωk;Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space;Step 4, determine optimum gene set, the gene of " optimum gene set " should i.e. be considered tumor-related gene.The method of the present invention, operand is little, and feasibility and effectiveness are all confirmed, and work efficiency and precision of prediction significantly improve.

Description

A kind of method for mixing and optimizing of tumor-related gene search
Technical field
The invention belongs to genetic search technical field, relate to the method for mixing and optimizing of a kind of tumor-related gene search.
Background technology
The latest developments of cancer gene group research will provide chance [1] for individualized cancer medical treatment.Tumor is a kind of high Degree heterogeneity, systematicness and the disease of complexity, it remains a significant obstacle of cancer Accurate Diagnosis and treatment.Tumor is suffered from There is different pathogenic paths in person, if using same type of Therapeutic Method to treat a certain class tumor, the most easily occurred Degree treatment or invalid treatment.One typical example is cancer therapy drug Herceptin, and it is a kind of interference human epidermal growth The antibody of factor acceptor (HER2), only the patient in HER2 overexpression uses just effectively [2].Therefore, the personalized doctor of tumor Treat the necessity highlighting tumor Molecular Classification, need to identify the hypotype that reliable Tumor biomarkers carrys out predicting tumors.
Nowadays, many high-throughput techniques, including microarray technology, owing to can monitor the table of thousands of genes simultaneously Reach value, because being successfully applied in the research carrying out tumor Molecular Classification and Tumor biomarkers identification [3].So And, the usual sample size of microarray data little (less than 100), number gene is very big (generally more than 10000).Need the key solved Problem is how to select the gene of one group of negligible amounts from thousands of gene, is subsequently used to exactly to tumor sample Carry out classify [4,5].
Summary of the invention
It is an object of the invention to provide the method for mixing and optimizing of a kind of tumor-related gene search, solve in prior art Exist is computationally intensive, and accuracy rate is low, the problem wasted time and energy.
The technical solution used in the present invention is, a kind of tumor-related gene search method for mixing and optimizing, specifically according to Lower step is implemented:
Step 1, support vector machine recursive feature elimination algorithm is utilized to obtain " ranked genes collection "
For a linear SVM classifier, there is an optimum hyperplane, its class interval is defined as:
w = Σ i = 1 n α i c i x i , - - - ( 1 )
Margin width=2/ | | w | |, (2)
Wherein, w is the vertical vector of optimal hyperlane;
xiIt is the sample i gene expression vector in training set, i=1,2 ..., k, k are the numbers supporting vector;
ci∈ [-1 ,+1] is the class label of sample i;
Weight αiThen it is calculated from training set, the weight α of most training vectorsiIt is zero, if this sample training vector Weight αiFor nonzero value, then for supporting vector, margin width refers to class interval;
SVM-RFE uses the step of backward elimination, repeatedly deletes each gene minimum to SVM classifier contribution, SVM- The object function J of RFE is defined as:
J=(1/2) | | w | |2, (3)
By approximate second Taylor series expansion J, approach each gene of removal with Optimal Brain Damage algorithm Cause the change of J, then have:
Δ J ( i ) = ∂ J ∂ w i Δw i + ∂ 2 J ∂ w i 2 ( Δw i ) 2 , - - - ( 4 )
During the optimization of J, its single order Taylor series are left in the basket, and then its second order Taylor series become:
Δ J (i)=(Δ wi)2, (5)
Due to Δ wi=wiWeight change relevant to removing ith feature in grader, therefore (wi)2By as SVM- The scoring criteria of RFE, has the eigenvalue (w of minimum every timei)2Feature will be eliminated;
Step 2, set up candidate gene collection Ωk
Before selection ranking, the gene of n is as candidate gene collection, and parameter n specifically regards the situation of microarray gene expression data collection Depending on;
Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space
Principle based on the survival of the fittest, the development of every generation will produce more more preferable approximate solution, in each generation, often The individual fitness function individual by Problem Areas is evaluated, and the individuality more adapted to is retained;Then, with intersecting and the something lost of variation Pass operation, create new solution collection;Circulation performs this process, until predetermined end condition;
Step 4, determine optimum gene set
Gene set is respectively organized, relatively the precision of prediction of each model and average gene polyadenylation signal collection size obtained by comparison step 3; In the case of precision of prediction is identical, the size selecting average gene subset is minimum as optimal parameter n, and with this optimal parameter N runs, and obtains the gene subset that gene dosage is minimum and precision of prediction is the highest, i.e. " optimum gene set ", is somebody's turn to do " optimum gene set " In gene be i.e. considered tumor-related gene.
The invention has the beneficial effects as follows, the method combines genetic algorithm (GA) and support vector machine recursive feature eliminates and calculates Method (SVM-RFE) [5-11] respective advantage, feasibility and effectiveness be all confirmed, and work efficiency and precision of prediction substantially carry High.
Accompanying drawing explanation
Fig. 1 is that the inventive method is for prostate and NCI60 data set, when number gene reduces to 1 from 100, classification 10 folding cross validation precision of prediction curve charts of device.
Detailed description of the invention
The method (hereinafter referred to as SVM-RFE/GA) of the present invention, specifically implements according to following steps:
Step 1, support vector machine recursive feature elimination algorithm (SVM-RFE) is utilized to obtain " ranked genes collection "
Support vector machine (SVM) is to have efficacious prescriptions for solve the sparse classification problems such as microarray gene expression data classification , for a linear SVM classifier, there is an optimum hyperplane in method, its class interval is defined as:
w = Σ i = 1 n α i c i x i , - - - ( 1 )
Margin width=2/ | | w | |, (2)
Wherein, w is the vertical vector of optimal hyperlane;
xiIt is the sample i gene expression vector in training set, i=1,2 ..., k, k are the numbers supporting vector;
ci∈ [-1 ,+1] is the class label of sample i;
Weight αiThen it is calculated from training set, the weight α of most training vectorsiIt is zero, if this sample training vector Weight αiFor nonzero value, then for supporting vector, margin width refers to class interval;
SVM-RFE is a kind of Embedded feature gene selection method [4], and SVM-RFE uses the step of backward elimination, instead Deleting each gene minimum to SVM classifier contribution again, the object function J of SVM-RFE is defined as:
J=(1/2) | | w | |2, (3)
By approximate second Taylor series expansion J, approach removal with Optimal Brain Damage (OBD) algorithm [12] Each gene causes the change of J, then have:
Δ J ( i ) = ∂ J ∂ w i Δw i + ∂ 2 J ∂ w i 2 ( Δw i ) 2 , - - - ( 4 )
During the optimization of J, its single order Taylor series are left in the basket, and then its second order Taylor series become:
Δ J (i)=(Δ wi)2, (5)
Due to Δ wi=wiWeight change relevant to removing ith feature in grader, therefore (wi)2By as SVM- The scoring criteria of RFE, has the eigenvalue (w of minimum every timei)2Feature will be eliminated, because it is on the impact of grader It is little,
Comprising the concrete steps that of SVM-RFE algorithm:
Input is: initial gene collection I={1;2;... n}, ranked genes collection O={};
It is output as: ranked genes collection O;
Repeat the following step 1.1-1.4, until initial gene collection I is empty:
1.1) using initial gene collection I as input variable, training dataset training linear SVM is used;
1.2) to all genes in initial gene collection I, calculate each gene score, calculate scoring criteria ri=(wi)2
1.3) gene with minimum Rank scores: g=argmin{r is selectedi};
1.4) ranked genes collection O and initial gene collection I:O=O ∪ g, I=I-g are updated respectively, by gene g from initial gene Collection I removes, and adds ranked genes collection O;Finally output obtains a ranked genes collection O;
Step 2, set up candidate gene collection Ωk
SVM-RFE algorithm eliminates the gene of " worst " in each step, for producing according to it classification " importance " Gene ranking;The SVM-RFE algorithm of step 1 of the present invention is applied to initial gene collection I, to produce ranked genes collection O, this phase When in a pre-filtering process, its objective is to remove incoherent and noisy gene, keep information gene simultaneously.
But, this SVM-RFE algorithm have ignored the interaction between gene, and this is also one of the defect of this algorithm.Cause This, the present invention selects the gene of n before ranking, sets up the candidate gene collection Ω of different genes quantityk, and use follow-up heredity calculation Method (GA) is to ΩkBeing optimized search, to removing some redundancy gene, the tumor-related gene quantity reaching to search is less Target.
When before selecting ranking, the gene of n is as candidate gene collection, the selection of number n is to realize follow-up genetic algorithm (GA) The key issue optimized, when n is too small, described grader can not obtain the highest precision of prediction;On the contrary, excessive as n Time, GA may be absorbed in local optimum, causes the gene dosage selected more, and preferred parameter n is limited between 5~100, specifically Depending on the situation of microarray gene expression data collection, such as, parameter n is respectively set to the numerical value of 10,20,30,50.
Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space
Genetic algorithm (GA) [13-15] is the search of a kind of overall adaptive probability based on natural selection and hereditism's principle Algorithm, the evolution selected the superior and eliminated the inferior in its simulation biosphere and the biological mechanism of natural selection, and restructuring and the heredity of sudden change Mechanism, GA is from the beginning of the initial population of stochastic generation, and the coding that each group comprises some is individual,
Principle based on the survival of the fittest, the development of every generation will produce more more preferable approximate solution, in each generation, often The individual fitness function individual by Problem Areas is evaluated, and the individuality more adapted to is retained;Then, with intersecting and the something lost of variation Pass operation, create new solution collection;Circulation performs this process, until predetermined end condition, this method utilizes GA Comprising the concrete steps that of algorithm:
3.1) individual expression: each individuality is by a N position binary vector coding, and wherein N is the size in heredity space, Place value is that " 1 " represents selected gene, for " 0 ", place value then represents that this gene is not selected;
3.2) fitness function is set: each individuality by support vector machine (SVM) classifier evaluation, such as WEKA platform SMO grader [16], object function makes the classification error rate of grader minimize;
3.3) genetic operator is set: genetic manipulation is selected by roulette (Roulette wheel selection), logical Cross single-point to intersect (single-point crossover) and bit flipping sudden change (bit flip mutation) enforcement, preferably The parameter of GA is: crossover probability=1, mutation probability=0.02, advanced lines=50, and population size=30.
Use 10 foldings intersection accuracy validation disaggregated models, owing to genetic algorithm is a kind of random search model, each Candidate gene collection ΩkPerform 5 tests, optimize and search out one group of nicety of grading the highest " gene set ".
Step 4, determine optimum gene set
Gene set is respectively organized, relatively the precision of prediction of each model and average gene polyadenylation signal collection size obtained by comparison step 3; In the case of precision of prediction is identical, the size selecting average gene subset is minimum as optimal parameter n (i.e. ranking in step 2 The gene of front n), and run with this optimal parameter n, obtain the gene subset that gene dosage is minimum and precision of prediction is the highest, determined Justice is " minimum basis factor set ", i.e. " optimum gene set ", and the gene being somebody's turn to do " optimum gene set " is i.e. considered tumor-related gene.
In the middle of the ten hundreds of genes using microarray technology detection, there is the gene [10] of following four type: 1) Information gene, the classification of this kind of gene pairs Cancer Molecular is important, and plays remarkable effect in tumor development;2) redundancy base Cause, this genoid is similar with information gene, may be relevant to cancer, but they are the most notable to the effect of Cancer Molecular classification;3) Uncorrelated gene, this genoid is uncorrelated with cancer, not affects cancer classification;4) noisy gene, this genoid has negative Face rings, and the presence of which may reduce cancer classification performance.Therefore, method and the purpose of gene Selection is, it is thus achieved that the One genoid i.e. information gene, removes other three genoid simultaneously.
It is an advantage of the present invention that the enforcement of step 1, it is possible to effectively remove uncorrelated and noisy gene so that step 2 Small numbers of gene can be selected to implement follow-up Optimizing Search.And the Optimizing Search of step 3, then can effective removal step The redundancy gene that 1 can not remove.By the enforcement of step 4, finally obtain that nicety of grading is the highest and number is minimum " optimal base because of Collection ", it is i.e. tumor-related gene.Step of the present invention is simple, and amount of calculation is little, simple and easy to do, overcomes existing SVM-RFE There is redundancy in the gene set that algorithm obtains, and existing genetic algorithm is computationally intensive, is easily absorbed in local optimum and cannot be independent The problem implemented, combines the respective advantage of SVM-RFE algorithm and genetic algorithm, and obtained optimum gene set number is little, and Nicety of grading is high, closely related with tumor, it is simple to the experimental verification in later stage.
Experimental verification
1) data set is extracted
The performance of SVM-RFE/GA model is carried out in one two classification and a multi-class microarray gene expression data collection Checking.Table 1 gives the basic condition of this data set.
The classification of 1, one two, table and a multi-class microarray gene expression data collection
Prostate (Prostate) data set be one two classification gene expression dataset, wherein comprise 52 example tumors and The sample of 50 example normal prostate tissues, this data set is from website (http://www.broadinstitute.org/cgi- Bin/cancer/datasets.cgi) download obtains.
NCI60 data set is multi-class gene expression dataset, and this data set comprises 9 kinds of tumor types and 60 samples This, this data set is downloaded from website (http://www.broadinstitute.org/mpr/NCI60/) and is obtained.
2) experiment porch
Experiment is carried out at WEKA [16] (http://www.cs.waikato.ac.nz/ml/weka/) platform.Use SMO Grader performs classification task, selects Polynomial kernel function (PolyKernel).Punishment parameter C of grader is arranged to 100, 10 folding cross-validation methods are used to evaluate the performance of SMO grader.The parameter of GA is provided that crossover probability=1, mutation probability =0.02, advanced lines=50, and population size=30.
The process that experimental data carries out pretreatment includes: remove house-keeping gene, wherein prostate data set residue 12533 Individual gene expression values, NCI60 data set 7071 gene expression values of residue;Gene expression values is standardized so that it is average It is 0 and standard deviation is 1.
3) experimental result
Generating ranked genes collection first with SVM-RFE algorithm, gene therein is arranged in decreasing order.Generally, former grind Study carefully the middle subset that can retain 50-100 number gene.The inventive method remains the gene subset of 100 before ranking.In order to test The performance of SVM-RFE algorithm, the number of gene is reduced to 100 to 1, and each step eliminates the gene of minimum score value, uses 10 times of friendships Fork verification method assesses the performance of this grader.At 100 genes that two data sets are initial, grader achieves 100% Precision of prediction.As it is shown in figure 1, in prostate and NCI60 data set, respectively with 9 and 80 minimum number gene, grader energy Obtain the accuracy rate of 100%.Result illustrate, compare multi-class data set, two-category data collection can with less number gene, Obtain gratifying classification results.At NCI60 data set, when gene number is less than 36, nicety of grading is less than 90%.But Prostate data set, it is only necessary to 9 genes, it is possible to obtain the 10 folding cross validation accuracy of 100%.
By SVM-RFE algorithm, before ranking, the gene of n is by as candidate gene collection, here n be respectively set to 10, 20、30、50.Owing to genetic algorithm is a kind of random search model, perform 5 tests, then result at each candidate gene collection Average.
At carcinoma of prostate data set, current 10 genes are retained, and genetic algorithm is to search the gene polyadenylation signal of minimal amount Collection and can reach 100% nicety of grading (being shown in Table 2).The mean size of gene subset is 5.4, obtains much smaller than SVM-RFE method Obtain 9 genes required for identical accuracy.
Table 2, at carcinoma of prostate data set, SVM-RFE/GA model obtain 10 foldings intersect accuracy
Top n genes Consensus forecast precision (%) Average gene sub-set size
10 100 5.4
20 100 7.0
30 100 8.0
50 100 13.2
In NCI60 data set, current 50 genes are retained, and genetic algorithm can search the gene subset of minimum, And realize the nicety of grading (being shown in Table 3) of 100%.Few than required for SVM-RFE method of the average sub-set size of 28 genes Many.SVM-RFE method needs 80 genes, to obtain identical precision.
Table 3, at NCI60 data set, SVM-RFE/GA model obtain 10 foldings intersect accuracy
Top n genes Consensus forecast precision (%) Average gene sub-set size
10 65.8 6
20 84.6 13.8
30 94.1 20
50 100 28
It has been observed that the selection of number n is a key issue of GA algorithm.When n is too small, described grader can not obtain Obtain the highest precision of prediction;On the contrary, when n is excessive, GA may be absorbed in local optimum, causes the gene dosage selected more.
The gene of minimum number can be realized and the highest gene subset of precision of prediction is defined as " optimum gene set ".? The data set of carcinoma of prostate, the genetic search of 10 before ranking, obtained gene subset comprises the number gene (n=of minimum 5), realize the forecasting accuracy (being shown in Table 4) of 100% simultaneously.At NCI60 cancer data collection, search for from front 50 genes of ranking, The gene subset obtained comprises the number gene (n=26) of minimum, realizes the forecasting accuracy (being shown in Table 5) of 100% simultaneously.
The optimum gene set obtained in table 4, carcinoma of prostate data set
The optimum gene set obtained in table 5, NCI60 data set
In number gene two aspect of precision of prediction and selection, the result that SVM-RFE/GA model is obtained and other calculation Method compares.Can reach at carcinoma of prostate data set (being shown in Table 6), only SVM-RFE/GA model and SVM-RFE algorithm The precision of prediction of 100%, but the less number gene of SVM-RFE/GA algorithms selection.In NCI60 data set (being shown in Table 7), The performance of SVM-RFE/GA algorithm is more prominent, in the case of realizing the precision of prediction of 100% equally, compares SVM-RFE algorithm (n =80), SVM-RFE/GA uses the gene dosage (n=26) of much less.
The results contrast of table 6, carcinoma of prostate data set, SVM-RFE/GA algorithm and other algorithms
The results contrast of table 7, NCI60 data set, SVM-RFE/GA algorithm and other algorithms
Gene Selection has been an important subject in microarray data analysis.Gene Selection Method is intended to eliminate Noisy, the uncorrelated and gene of redundancy, this computation burden being possible not only to reduce grader, also improve grader simultaneously Nicety of grading.In one aspect of the method, selected information gene subset comprises less gene dosage, it is easier to subsequently Molecular biology experiment is verified.
In sum, the present invention proposes the model that a GA algorithm combines with SVM-RFE algorithm, it is possible to combines and embeds Formula and the respective advantage of wound form method, the method is simultaneously at one two classification and multi-class microarray gene expression data collection Verify.Result shows, compares other algorithm, and feature gene selection method proposed by the invention can be with less letter Breath number gene, reaches the highest nicety of grading.Optimum gene set (table 4 and table 5) obtained by this experiment, part therein Gene document report is close with the generation development relationship of tumor, and remaining portion gene can be real by the molecular biology in later stage Test, implement checking further, to finding brand-new oncogene mark.
List of references:
[1]Chin L,Andersen JN,Futreal PA(2011).Cancer genomics:from discovery science to personalized medicine.Nat Med 17(3):297-303.
[2]Ong FS,Das K,Wang J,Vakil H,Kuo JZ,Blackwell WL,Lim SW,Goodarzi MO, Bernstein KE,Rotter JI,Grody WW(2012).Personalized medicine and pharmacogenetic biomarkers:progress in molecular oncology testing.Expert Rev Mol Diagn 12(6):593-602.
[3]Golub TR,Slonim DK,Tamayo P,Huard C,Gaasenbeek M,Mesirov JP,Coller H, Loh ML,Downing JR,Caligiuri MA,Bloomfield CD,Lander ES (1999).Molecular classification of cancer:class discovery and class prediction by gene expression monitoring.Science 286(5439):531-7.
[4]Saeys Y,Inza I,Larranaga P(2007).A review of feature selection techniques in bioinformatics.Bioinformatics 23(19):2507-17.
[5]Li X,Peng S,Chen J,Lu B,Zhang H,Lai M(2012).SVM-T-RFE:A novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles.Biochemical and Biophysical Research Communications 419(2):148-53.
[6]Guyon I,Weston J,Barnhill S,Vapnik V(2002).Gene selection for cancer classification using support vector machines.Machine Learning 46(1-3):389- 422.
[7]Duan KB,Rajapakse JC,Wang HY,Azuaje F(2005).Multiple SVM-RFE for gene selection in cancer classification with expression data.Ieee Transactions on Nanobioscience 4(3):228-34.
[8]Zhang XG,Lu X,Shi Q,Xu XQ,Leung HCE,Harris LN,D Iglehart J,Miron A,Liu JS,Wong WH(2006).Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.BMC Bioinformatics 7:-.
[9]Zhou X,Tuck DP(2007).MSVM-RFE:extensions of SVM-RFE for multiclass gene selection on DNA microarray data(vol 23,pg 1106,2007).Bioinformatics 23 (15):2029-.
[10]Tang YC,Zhang YQ,Huang Z(2007).Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis.Ieee-Acm Transactions on Computational Biology and Bioinformatics 4(3):365-81.
[11]Mundra PA,Rajapakse JC(2010).SVM-RFE With MRMR Filter for Gene Selection.Ieee Transactions on Nanobioscience 9(1):31-7.
[12]Le Cun Y,Denker J,Solla S,Touretzky DS.Optimal brain damage.Advances in Neural Information Processing Systems:Morgan Kaufmann;1990.p.598-605.
[13]Tan F,Fu X,Zhang Y,Bourgeois A(2008).A genetic algorithm-based method for feature subset selection.Soft Computing 12(2):111-20.
[14]Nicoletta D,Barbara P(2009).An evolutionary method for combining different feature selection criteria in microarray data classification.2009: 1-10.
[15]Cannas L,Dessi N,Pes B.A Hybrid Model to Favor the Selection of High Quality Features in High Dimensional Domains.Intelligent Data Engineering and Automated Learning-IDEAL 2011:Springer Berlin Heidelberg;2011.p.228-35.
[16]Mark H,Eibe F,Geoffrey H,Bernhard P,Peter R,Ian HW(2009).The WEKA data mining software:an update.SIGKDD Explor Newsl 11(1):10-8.
[17]Singh D,Febbo PG,Ross K,Jackson DG,Manola J,Ladd C,Tamayo P,Renshaw AA,D'Amico AV,Richie JP,Lander ES,Loda M,Kantoff PW,Golub TR,Sellers WR (2002).Gene expression correlates of clinical prostate cancer behavior.Cancer Cell 1(2):203-9.
[18]Staunton JE,Slonim DK,Coller HA,Tamayo P,Angelo MJ,Park J,Scherf U, Lee JK,Reinhold WO,Weinstein JN,Mesirov JP,Lander ES,Golub TR(2001) .Chemosensitivity prediction by transcriptional profiling.Proc Natl Acad Sci U S A 98(19):10787-92.
[19]Tan AC,Naiman DQ,Xu L,Winslow RL,Geman D(2005).Simple decision rules for classifying human cancers from gene expression profiles.Bioinformatics 21 (20):3896-904.
[20]Peng SH,Xu QH,Ling XB,Peng XN,Du W,Chen LB(2003).Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines.Febs Letters 555(2):358-62.
[21]Ooi CH,Tan P(2003).Genetic algorithms applied to multi-class prediction for the analysis of gene expression data.Bioinformatics 19(1):37- 44。

Claims (3)

1. a method for mixing and optimizing for tumor-related gene search, is characterized in that, specifically implements according to following steps:
Step 1, support vector machine recursive feature elimination algorithm is utilized to obtain " ranked genes collection "
For a linear SVM classifier, there is an optimum hyperplane, its class interval is defined as:
w = Σ i = 1 n α i c i x i , - - - ( 1 )
Margin width=2/ | | w | |, (2)
Wherein, w is the vertical vector of optimal hyperlane;
xiIt is the sample i gene expression vector in training set, i=1,2 ..., k, k are the numbers supporting vector;
ci∈ [-1 ,+1] is the class label of sample i;
Weight αiThen it is calculated from training set, the weight α of most training vectorsiIt is zero, if the weight of this sample training vector αiFor nonzero value, then for supporting vector, margin width refers to class interval;
SVM-RFE uses the step of backward elimination, repeatedly deletes each gene minimum to SVM classifier contribution, SVM-RFE's Object function J is defined as:
J=(1/2) | | w | |2, (3)
By approximate second Taylor series expansion J, approach each gene of removal with Optimal Brain Damage algorithm and cause J Change, then have:
Δ J ( i ) = ∂ J ∂ w i Δw i + ∂ 2 J ∂ w i 2 ( Δw i ) 2 , - - - ( 4 )
During the optimization of J, its single order Taylor series are left in the basket, and then its second order Taylor series become:
Δ J (i)=(Δ wi)2, (5)
Due to Δ wi=wiWeight change relevant to removing ith feature in grader, therefore (wi)2By as SVM-RFE's Scoring criteria, has the eigenvalue (w of minimum every timei)2Feature will be eliminated;
Step 2, set up candidate gene collection Ωk
Select before ranking the gene of n as candidate gene collection, parameter n specifically regard microarray gene expression data collection situation and Fixed;
Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space
Principle based on the survival of the fittest, the development of every generation will produce more more preferable approximate solution, in each generation, often each and every one Body is evaluated by the fitness function of Problem Areas, and the individuality more adapted to is retained;Then, with intersecting and the heredity behaviour of variation Make, create new solution collection;Circulation performs this process, until predetermined end condition;
Step 4, determine optimum gene set
Gene set is respectively organized, relatively the precision of prediction of each model and average gene polyadenylation signal collection size obtained by comparison step 3;In advance In the case of survey precision is identical, the size selecting average gene subset is minimum as optimal parameter n, and transports with this optimal parameter n OK, obtain the gene subset that gene dosage is minimum and precision of prediction is the highest, i.e. " optimum gene set ", be somebody's turn to do in " optimum gene set " Gene is i.e. considered tumor-related gene.
The method for mixing and optimizing of tumor-related gene the most according to claim 1 search, is characterized in that, described step 1 In, utilize comprising the concrete steps that of SVM-RFE algorithm:
Input initial gene collection I={1;2;... n} and ranked genes collection O={};
Repeat the following step 1.1-1.4, until initial gene collection I is empty:
1.1) using initial gene collection I as input variable, training dataset training linear SVM is used;
1.2) to all genes in initial gene collection I, calculate each gene score, calculate scoring criteria ri=(wi)2
1.3) select to have the gene of minimum Rank scores: g=arg min{ri};
1.4) ranked genes collection O and initial gene collection I:O=O ∪ g, I=I-g are updated respectively, by gene g from initial gene collection I Remove, add ranked genes collection O;Finally output obtains a ranked genes collection O.
The method for mixing and optimizing of tumor-related gene the most according to claim 1 search, is characterized in that, described step 3 In, utilize comprising the concrete steps that of GA algorithm:
3.1) individual expression: each individuality is by a N position binary vector coding, and wherein N is the size in heredity space, place value Represent selected gene for " 1 ", for " 0 ", place value then represents that this gene is not selected;
3.2) fitness function is set: each individuality is assessed by support vector machine classifier, the SMO classification of such as WEKA platform Device, object function makes the classification error rate of grader minimize;
3.3) genetic operator is set: genetic manipulation passes through roulette selection, is intersected by single-point and bit flipping sudden change is implemented.
CN201610555700.6A 2016-07-12 2016-07-12 A kind of method for mixing and optimizing of tumor-related gene search Pending CN106228034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610555700.6A CN106228034A (en) 2016-07-12 2016-07-12 A kind of method for mixing and optimizing of tumor-related gene search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610555700.6A CN106228034A (en) 2016-07-12 2016-07-12 A kind of method for mixing and optimizing of tumor-related gene search

Publications (1)

Publication Number Publication Date
CN106228034A true CN106228034A (en) 2016-12-14

Family

ID=57520292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610555700.6A Pending CN106228034A (en) 2016-07-12 2016-07-12 A kind of method for mixing and optimizing of tumor-related gene search

Country Status (1)

Country Link
CN (1) CN106228034A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709267A (en) * 2017-01-25 2017-05-24 武汉贝纳科技服务有限公司 Data acquisition method and device
CN108615555A (en) * 2018-04-26 2018-10-02 山东师范大学 Colorectal cancer prediction technique and device based on marker gene and mixed kernel function SVM
CN112729411A (en) * 2021-01-14 2021-04-30 金陵科技学院 Distributed drug warehouse environment monitoring method based on GA-RNN
CN113901999A (en) * 2021-09-29 2022-01-07 国网四川省电力公司电力科学研究院 Fault diagnosis method and system for high-voltage shunt reactor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090175531A1 (en) * 2004-11-19 2009-07-09 Koninklijke Philips Electronics, N.V. System and method for false positive reduction in computer-aided detection (cad) using a support vector macnine (svm)
CN102170130A (en) * 2011-04-26 2011-08-31 华北电力大学 Short-term wind power prediction method
CN102272764A (en) * 2009-01-06 2011-12-07 皇家飞利浦电子股份有限公司 Evolutionary clustering algorithm
CN103186717A (en) * 2013-01-18 2013-07-03 中国科学院合肥物质科学研究院 Heuristic breadth-first searching method for cancer-related genes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090175531A1 (en) * 2004-11-19 2009-07-09 Koninklijke Philips Electronics, N.V. System and method for false positive reduction in computer-aided detection (cad) using a support vector macnine (svm)
CN102272764A (en) * 2009-01-06 2011-12-07 皇家飞利浦电子股份有限公司 Evolutionary clustering algorithm
CN102170130A (en) * 2011-04-26 2011-08-31 华北电力大学 Short-term wind power prediction method
CN103186717A (en) * 2013-01-18 2013-07-03 中国科学院合肥物质科学研究院 Heuristic breadth-first searching method for cancer-related genes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOBO LI: "Gene selection for cancer classification using the combination of SVM-RFE and GA", 《COMPUTER MODELLING & NEW TECHNOLOGIES》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709267A (en) * 2017-01-25 2017-05-24 武汉贝纳科技服务有限公司 Data acquisition method and device
CN108615555A (en) * 2018-04-26 2018-10-02 山东师范大学 Colorectal cancer prediction technique and device based on marker gene and mixed kernel function SVM
CN112729411A (en) * 2021-01-14 2021-04-30 金陵科技学院 Distributed drug warehouse environment monitoring method based on GA-RNN
CN112729411B (en) * 2021-01-14 2022-09-13 金陵科技学院 Distributed drug warehouse environment monitoring method based on GA-RNN
CN113901999A (en) * 2021-09-29 2022-01-07 国网四川省电力公司电力科学研究院 Fault diagnosis method and system for high-voltage shunt reactor
CN113901999B (en) * 2021-09-29 2023-09-29 国网四川省电力公司电力科学研究院 Fault diagnosis method and system for high-voltage shunt reactor

Similar Documents

Publication Publication Date Title
Sayed et al. A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets
Algamal et al. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification
Alshamlan et al. mRMR‐ABC: a hybrid gene selection algorithm for cancer classification using microarray gene expression profiling
Algamal et al. Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification
Thakur et al. [Retracted] Gene Expression‐Assisted Cancer Prediction Techniques
Chuang et al. A hybrid feature selection method for DNA microarray data
Abdi et al. A novel weighted support vector machine based on particle swarm optimization for gene selection and tumor classification
Hameed et al. Filter-Wrapper Combination and Embedded Feature Selection for Gene Expression Data.
Jörnsten Clustering and classification based on the L1 data depth
JP2020501240A (en) Methods and systems for predicting DNA accessibility in pan-cancer genomes
CN112201346B (en) Cancer lifetime prediction method, device, computing equipment and computer readable storage medium
Abdulla et al. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays
CN106228034A (en) A kind of method for mixing and optimizing of tumor-related gene search
Dhillon et al. eBreCaP: extreme learning‐based model for breast cancer survival prediction
Luque-Baena et al. Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data
Kumar et al. An amalgam method efficient for finding of cancer gene using CSC from micro array data
Thakur et al. Machine learning techniques with ANOVA for the prediction of breast cancer
Ghorai et al. Multicategory cancer classification from gene expression data by multiclass NPPC ensemble
Ray et al. Transforming Breast Cancer Identification: An In-Depth Examination of Advanced Machine Learning Models Applied to Histopathological Images
Yang et al. Feature selection using memetic algorithms
Huang et al. Classifying breast cancer subtypes on multi-omics data via sparse canonical correlation analysis and deep learning
Shi et al. Integration of Cancer Genomics Data for Tree‐based Dimensionality Reduction and Cancer Outcome Prediction
Bustamam et al. Lung cancer classification based on support vector machine-recursive feature elimination and artificial bee colony
Jia et al. DCCAFN: deep convolution cascade attention fusion network based on imaging genomics for prediction survival analysis of lung cancer
Palmal et al. Integrative prognostic modeling for breast cancer: Unveiling optimal multimodal combinations using graph convolutional networks and calibrated random forest

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161214

RJ01 Rejection of invention patent application after publication