CN106228034A - A kind of method for mixing and optimizing of tumor-related gene search - Google Patents
A kind of method for mixing and optimizing of tumor-related gene search Download PDFInfo
- Publication number
- CN106228034A CN106228034A CN201610555700.6A CN201610555700A CN106228034A CN 106228034 A CN106228034 A CN 106228034A CN 201610555700 A CN201610555700 A CN 201610555700A CN 106228034 A CN106228034 A CN 106228034A
- Authority
- CN
- China
- Prior art keywords
- gene
- collection
- svm
- rfe
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computing Systems (AREA)
- Analytical Chemistry (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Physiology (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses the method for mixing and optimizing of a kind of tumor-related gene search, step includes: step 1, utilize support vector machine recursive feature elimination algorithm to obtain " ranked genes collection ";Step 2, set up candidate gene collection Ωk;Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space;Step 4, determine optimum gene set, the gene of " optimum gene set " should i.e. be considered tumor-related gene.The method of the present invention, operand is little, and feasibility and effectiveness are all confirmed, and work efficiency and precision of prediction significantly improve.
Description
Technical field
The invention belongs to genetic search technical field, relate to the method for mixing and optimizing of a kind of tumor-related gene search.
Background technology
The latest developments of cancer gene group research will provide chance [1] for individualized cancer medical treatment.Tumor is a kind of high
Degree heterogeneity, systematicness and the disease of complexity, it remains a significant obstacle of cancer Accurate Diagnosis and treatment.Tumor is suffered from
There is different pathogenic paths in person, if using same type of Therapeutic Method to treat a certain class tumor, the most easily occurred
Degree treatment or invalid treatment.One typical example is cancer therapy drug Herceptin, and it is a kind of interference human epidermal growth
The antibody of factor acceptor (HER2), only the patient in HER2 overexpression uses just effectively [2].Therefore, the personalized doctor of tumor
Treat the necessity highlighting tumor Molecular Classification, need to identify the hypotype that reliable Tumor biomarkers carrys out predicting tumors.
Nowadays, many high-throughput techniques, including microarray technology, owing to can monitor the table of thousands of genes simultaneously
Reach value, because being successfully applied in the research carrying out tumor Molecular Classification and Tumor biomarkers identification [3].So
And, the usual sample size of microarray data little (less than 100), number gene is very big (generally more than 10000).Need the key solved
Problem is how to select the gene of one group of negligible amounts from thousands of gene, is subsequently used to exactly to tumor sample
Carry out classify [4,5].
Summary of the invention
It is an object of the invention to provide the method for mixing and optimizing of a kind of tumor-related gene search, solve in prior art
Exist is computationally intensive, and accuracy rate is low, the problem wasted time and energy.
The technical solution used in the present invention is, a kind of tumor-related gene search method for mixing and optimizing, specifically according to
Lower step is implemented:
Step 1, support vector machine recursive feature elimination algorithm is utilized to obtain " ranked genes collection "
For a linear SVM classifier, there is an optimum hyperplane, its class interval is defined as:
Margin width=2/ | | w | |, (2)
Wherein, w is the vertical vector of optimal hyperlane;
xiIt is the sample i gene expression vector in training set, i=1,2 ..., k, k are the numbers supporting vector;
ci∈ [-1 ,+1] is the class label of sample i;
Weight αiThen it is calculated from training set, the weight α of most training vectorsiIt is zero, if this sample training vector
Weight αiFor nonzero value, then for supporting vector, margin width refers to class interval;
SVM-RFE uses the step of backward elimination, repeatedly deletes each gene minimum to SVM classifier contribution, SVM-
The object function J of RFE is defined as:
J=(1/2) | | w | |2, (3)
By approximate second Taylor series expansion J, approach each gene of removal with Optimal Brain Damage algorithm
Cause the change of J, then have:
During the optimization of J, its single order Taylor series are left in the basket, and then its second order Taylor series become:
Δ J (i)=(Δ wi)2, (5)
Due to Δ wi=wiWeight change relevant to removing ith feature in grader, therefore (wi)2By as SVM-
The scoring criteria of RFE, has the eigenvalue (w of minimum every timei)2Feature will be eliminated;
Step 2, set up candidate gene collection Ωk
Before selection ranking, the gene of n is as candidate gene collection, and parameter n specifically regards the situation of microarray gene expression data collection
Depending on;
Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space
Principle based on the survival of the fittest, the development of every generation will produce more more preferable approximate solution, in each generation, often
The individual fitness function individual by Problem Areas is evaluated, and the individuality more adapted to is retained;Then, with intersecting and the something lost of variation
Pass operation, create new solution collection;Circulation performs this process, until predetermined end condition;
Step 4, determine optimum gene set
Gene set is respectively organized, relatively the precision of prediction of each model and average gene polyadenylation signal collection size obtained by comparison step 3;
In the case of precision of prediction is identical, the size selecting average gene subset is minimum as optimal parameter n, and with this optimal parameter
N runs, and obtains the gene subset that gene dosage is minimum and precision of prediction is the highest, i.e. " optimum gene set ", is somebody's turn to do " optimum gene set "
In gene be i.e. considered tumor-related gene.
The invention has the beneficial effects as follows, the method combines genetic algorithm (GA) and support vector machine recursive feature eliminates and calculates
Method (SVM-RFE) [5-11] respective advantage, feasibility and effectiveness be all confirmed, and work efficiency and precision of prediction substantially carry
High.
Accompanying drawing explanation
Fig. 1 is that the inventive method is for prostate and NCI60 data set, when number gene reduces to 1 from 100, classification
10 folding cross validation precision of prediction curve charts of device.
Detailed description of the invention
The method (hereinafter referred to as SVM-RFE/GA) of the present invention, specifically implements according to following steps:
Step 1, support vector machine recursive feature elimination algorithm (SVM-RFE) is utilized to obtain " ranked genes collection "
Support vector machine (SVM) is to have efficacious prescriptions for solve the sparse classification problems such as microarray gene expression data classification
, for a linear SVM classifier, there is an optimum hyperplane in method, its class interval is defined as:
Margin width=2/ | | w | |, (2)
Wherein, w is the vertical vector of optimal hyperlane;
xiIt is the sample i gene expression vector in training set, i=1,2 ..., k, k are the numbers supporting vector;
ci∈ [-1 ,+1] is the class label of sample i;
Weight αiThen it is calculated from training set, the weight α of most training vectorsiIt is zero, if this sample training vector
Weight αiFor nonzero value, then for supporting vector, margin width refers to class interval;
SVM-RFE is a kind of Embedded feature gene selection method [4], and SVM-RFE uses the step of backward elimination, instead
Deleting each gene minimum to SVM classifier contribution again, the object function J of SVM-RFE is defined as:
J=(1/2) | | w | |2, (3)
By approximate second Taylor series expansion J, approach removal with Optimal Brain Damage (OBD) algorithm [12]
Each gene causes the change of J, then have:
During the optimization of J, its single order Taylor series are left in the basket, and then its second order Taylor series become:
Δ J (i)=(Δ wi)2, (5)
Due to Δ wi=wiWeight change relevant to removing ith feature in grader, therefore (wi)2By as SVM-
The scoring criteria of RFE, has the eigenvalue (w of minimum every timei)2Feature will be eliminated, because it is on the impact of grader
It is little,
Comprising the concrete steps that of SVM-RFE algorithm:
Input is: initial gene collection I={1;2;... n}, ranked genes collection O={};
It is output as: ranked genes collection O;
Repeat the following step 1.1-1.4, until initial gene collection I is empty:
1.1) using initial gene collection I as input variable, training dataset training linear SVM is used;
1.2) to all genes in initial gene collection I, calculate each gene score, calculate scoring criteria ri=(wi)2;
1.3) gene with minimum Rank scores: g=argmin{r is selectedi};
1.4) ranked genes collection O and initial gene collection I:O=O ∪ g, I=I-g are updated respectively, by gene g from initial gene
Collection I removes, and adds ranked genes collection O;Finally output obtains a ranked genes collection O;
Step 2, set up candidate gene collection Ωk
SVM-RFE algorithm eliminates the gene of " worst " in each step, for producing according to it classification " importance "
Gene ranking;The SVM-RFE algorithm of step 1 of the present invention is applied to initial gene collection I, to produce ranked genes collection O, this phase
When in a pre-filtering process, its objective is to remove incoherent and noisy gene, keep information gene simultaneously.
But, this SVM-RFE algorithm have ignored the interaction between gene, and this is also one of the defect of this algorithm.Cause
This, the present invention selects the gene of n before ranking, sets up the candidate gene collection Ω of different genes quantityk, and use follow-up heredity calculation
Method (GA) is to ΩkBeing optimized search, to removing some redundancy gene, the tumor-related gene quantity reaching to search is less
Target.
When before selecting ranking, the gene of n is as candidate gene collection, the selection of number n is to realize follow-up genetic algorithm (GA)
The key issue optimized, when n is too small, described grader can not obtain the highest precision of prediction;On the contrary, excessive as n
Time, GA may be absorbed in local optimum, causes the gene dosage selected more, and preferred parameter n is limited between 5~100, specifically
Depending on the situation of microarray gene expression data collection, such as, parameter n is respectively set to the numerical value of 10,20,30,50.
Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space
Genetic algorithm (GA) [13-15] is the search of a kind of overall adaptive probability based on natural selection and hereditism's principle
Algorithm, the evolution selected the superior and eliminated the inferior in its simulation biosphere and the biological mechanism of natural selection, and restructuring and the heredity of sudden change
Mechanism, GA is from the beginning of the initial population of stochastic generation, and the coding that each group comprises some is individual,
Principle based on the survival of the fittest, the development of every generation will produce more more preferable approximate solution, in each generation, often
The individual fitness function individual by Problem Areas is evaluated, and the individuality more adapted to is retained;Then, with intersecting and the something lost of variation
Pass operation, create new solution collection;Circulation performs this process, until predetermined end condition, this method utilizes GA
Comprising the concrete steps that of algorithm:
3.1) individual expression: each individuality is by a N position binary vector coding, and wherein N is the size in heredity space,
Place value is that " 1 " represents selected gene, for " 0 ", place value then represents that this gene is not selected;
3.2) fitness function is set: each individuality by support vector machine (SVM) classifier evaluation, such as WEKA platform
SMO grader [16], object function makes the classification error rate of grader minimize;
3.3) genetic operator is set: genetic manipulation is selected by roulette (Roulette wheel selection), logical
Cross single-point to intersect (single-point crossover) and bit flipping sudden change (bit flip mutation) enforcement, preferably
The parameter of GA is: crossover probability=1, mutation probability=0.02, advanced lines=50, and population size=30.
Use 10 foldings intersection accuracy validation disaggregated models, owing to genetic algorithm is a kind of random search model, each
Candidate gene collection ΩkPerform 5 tests, optimize and search out one group of nicety of grading the highest " gene set ".
Step 4, determine optimum gene set
Gene set is respectively organized, relatively the precision of prediction of each model and average gene polyadenylation signal collection size obtained by comparison step 3;
In the case of precision of prediction is identical, the size selecting average gene subset is minimum as optimal parameter n (i.e. ranking in step 2
The gene of front n), and run with this optimal parameter n, obtain the gene subset that gene dosage is minimum and precision of prediction is the highest, determined
Justice is " minimum basis factor set ", i.e. " optimum gene set ", and the gene being somebody's turn to do " optimum gene set " is i.e. considered tumor-related gene.
In the middle of the ten hundreds of genes using microarray technology detection, there is the gene [10] of following four type: 1)
Information gene, the classification of this kind of gene pairs Cancer Molecular is important, and plays remarkable effect in tumor development;2) redundancy base
Cause, this genoid is similar with information gene, may be relevant to cancer, but they are the most notable to the effect of Cancer Molecular classification;3)
Uncorrelated gene, this genoid is uncorrelated with cancer, not affects cancer classification;4) noisy gene, this genoid has negative
Face rings, and the presence of which may reduce cancer classification performance.Therefore, method and the purpose of gene Selection is, it is thus achieved that the
One genoid i.e. information gene, removes other three genoid simultaneously.
It is an advantage of the present invention that the enforcement of step 1, it is possible to effectively remove uncorrelated and noisy gene so that step 2
Small numbers of gene can be selected to implement follow-up Optimizing Search.And the Optimizing Search of step 3, then can effective removal step
The redundancy gene that 1 can not remove.By the enforcement of step 4, finally obtain that nicety of grading is the highest and number is minimum " optimal base because of
Collection ", it is i.e. tumor-related gene.Step of the present invention is simple, and amount of calculation is little, simple and easy to do, overcomes existing SVM-RFE
There is redundancy in the gene set that algorithm obtains, and existing genetic algorithm is computationally intensive, is easily absorbed in local optimum and cannot be independent
The problem implemented, combines the respective advantage of SVM-RFE algorithm and genetic algorithm, and obtained optimum gene set number is little, and
Nicety of grading is high, closely related with tumor, it is simple to the experimental verification in later stage.
Experimental verification
1) data set is extracted
The performance of SVM-RFE/GA model is carried out in one two classification and a multi-class microarray gene expression data collection
Checking.Table 1 gives the basic condition of this data set.
The classification of 1, one two, table and a multi-class microarray gene expression data collection
Prostate (Prostate) data set be one two classification gene expression dataset, wherein comprise 52 example tumors and
The sample of 50 example normal prostate tissues, this data set is from website (http://www.broadinstitute.org/cgi-
Bin/cancer/datasets.cgi) download obtains.
NCI60 data set is multi-class gene expression dataset, and this data set comprises 9 kinds of tumor types and 60 samples
This, this data set is downloaded from website (http://www.broadinstitute.org/mpr/NCI60/) and is obtained.
2) experiment porch
Experiment is carried out at WEKA [16] (http://www.cs.waikato.ac.nz/ml/weka/) platform.Use SMO
Grader performs classification task, selects Polynomial kernel function (PolyKernel).Punishment parameter C of grader is arranged to 100,
10 folding cross-validation methods are used to evaluate the performance of SMO grader.The parameter of GA is provided that crossover probability=1, mutation probability
=0.02, advanced lines=50, and population size=30.
The process that experimental data carries out pretreatment includes: remove house-keeping gene, wherein prostate data set residue 12533
Individual gene expression values, NCI60 data set 7071 gene expression values of residue;Gene expression values is standardized so that it is average
It is 0 and standard deviation is 1.
3) experimental result
Generating ranked genes collection first with SVM-RFE algorithm, gene therein is arranged in decreasing order.Generally, former grind
Study carefully the middle subset that can retain 50-100 number gene.The inventive method remains the gene subset of 100 before ranking.In order to test
The performance of SVM-RFE algorithm, the number of gene is reduced to 100 to 1, and each step eliminates the gene of minimum score value, uses 10 times of friendships
Fork verification method assesses the performance of this grader.At 100 genes that two data sets are initial, grader achieves 100%
Precision of prediction.As it is shown in figure 1, in prostate and NCI60 data set, respectively with 9 and 80 minimum number gene, grader energy
Obtain the accuracy rate of 100%.Result illustrate, compare multi-class data set, two-category data collection can with less number gene,
Obtain gratifying classification results.At NCI60 data set, when gene number is less than 36, nicety of grading is less than 90%.But
Prostate data set, it is only necessary to 9 genes, it is possible to obtain the 10 folding cross validation accuracy of 100%.
By SVM-RFE algorithm, before ranking, the gene of n is by as candidate gene collection, here n be respectively set to 10,
20、30、50.Owing to genetic algorithm is a kind of random search model, perform 5 tests, then result at each candidate gene collection
Average.
At carcinoma of prostate data set, current 10 genes are retained, and genetic algorithm is to search the gene polyadenylation signal of minimal amount
Collection and can reach 100% nicety of grading (being shown in Table 2).The mean size of gene subset is 5.4, obtains much smaller than SVM-RFE method
Obtain 9 genes required for identical accuracy.
Table 2, at carcinoma of prostate data set, SVM-RFE/GA model obtain 10 foldings intersect accuracy
Top n genes | Consensus forecast precision (%) | Average gene sub-set size |
10 | 100 | 5.4 |
20 | 100 | 7.0 |
30 | 100 | 8.0 |
50 | 100 | 13.2 |
In NCI60 data set, current 50 genes are retained, and genetic algorithm can search the gene subset of minimum,
And realize the nicety of grading (being shown in Table 3) of 100%.Few than required for SVM-RFE method of the average sub-set size of 28 genes
Many.SVM-RFE method needs 80 genes, to obtain identical precision.
Table 3, at NCI60 data set, SVM-RFE/GA model obtain 10 foldings intersect accuracy
Top n genes | Consensus forecast precision (%) | Average gene sub-set size |
10 | 65.8 | 6 |
20 | 84.6 | 13.8 |
30 | 94.1 | 20 |
50 | 100 | 28 |
It has been observed that the selection of number n is a key issue of GA algorithm.When n is too small, described grader can not obtain
Obtain the highest precision of prediction;On the contrary, when n is excessive, GA may be absorbed in local optimum, causes the gene dosage selected more.
The gene of minimum number can be realized and the highest gene subset of precision of prediction is defined as " optimum gene set ".?
The data set of carcinoma of prostate, the genetic search of 10 before ranking, obtained gene subset comprises the number gene (n=of minimum
5), realize the forecasting accuracy (being shown in Table 4) of 100% simultaneously.At NCI60 cancer data collection, search for from front 50 genes of ranking,
The gene subset obtained comprises the number gene (n=26) of minimum, realizes the forecasting accuracy (being shown in Table 5) of 100% simultaneously.
The optimum gene set obtained in table 4, carcinoma of prostate data set
The optimum gene set obtained in table 5, NCI60 data set
In number gene two aspect of precision of prediction and selection, the result that SVM-RFE/GA model is obtained and other calculation
Method compares.Can reach at carcinoma of prostate data set (being shown in Table 6), only SVM-RFE/GA model and SVM-RFE algorithm
The precision of prediction of 100%, but the less number gene of SVM-RFE/GA algorithms selection.In NCI60 data set (being shown in Table 7),
The performance of SVM-RFE/GA algorithm is more prominent, in the case of realizing the precision of prediction of 100% equally, compares SVM-RFE algorithm (n
=80), SVM-RFE/GA uses the gene dosage (n=26) of much less.
The results contrast of table 6, carcinoma of prostate data set, SVM-RFE/GA algorithm and other algorithms
The results contrast of table 7, NCI60 data set, SVM-RFE/GA algorithm and other algorithms
Gene Selection has been an important subject in microarray data analysis.Gene Selection Method is intended to eliminate
Noisy, the uncorrelated and gene of redundancy, this computation burden being possible not only to reduce grader, also improve grader simultaneously
Nicety of grading.In one aspect of the method, selected information gene subset comprises less gene dosage, it is easier to subsequently
Molecular biology experiment is verified.
In sum, the present invention proposes the model that a GA algorithm combines with SVM-RFE algorithm, it is possible to combines and embeds
Formula and the respective advantage of wound form method, the method is simultaneously at one two classification and multi-class microarray gene expression data collection
Verify.Result shows, compares other algorithm, and feature gene selection method proposed by the invention can be with less letter
Breath number gene, reaches the highest nicety of grading.Optimum gene set (table 4 and table 5) obtained by this experiment, part therein
Gene document report is close with the generation development relationship of tumor, and remaining portion gene can be real by the molecular biology in later stage
Test, implement checking further, to finding brand-new oncogene mark.
List of references:
[1]Chin L,Andersen JN,Futreal PA(2011).Cancer genomics:from discovery
science to personalized medicine.Nat Med 17(3):297-303.
[2]Ong FS,Das K,Wang J,Vakil H,Kuo JZ,Blackwell WL,Lim SW,Goodarzi MO,
Bernstein KE,Rotter JI,Grody WW(2012).Personalized medicine and
pharmacogenetic biomarkers:progress in molecular oncology testing.Expert Rev
Mol Diagn 12(6):593-602.
[3]Golub TR,Slonim DK,Tamayo P,Huard C,Gaasenbeek M,Mesirov JP,Coller H,
Loh ML,Downing JR,Caligiuri MA,Bloomfield CD,Lander ES (1999).Molecular
classification of cancer:class discovery and class prediction by gene
expression monitoring.Science 286(5439):531-7.
[4]Saeys Y,Inza I,Larranaga P(2007).A review of feature selection
techniques in bioinformatics.Bioinformatics 23(19):2507-17.
[5]Li X,Peng S,Chen J,Lu B,Zhang H,Lai M(2012).SVM-T-RFE:A novel gene
selection algorithm for identifying metastasis-related genes in colorectal
cancer using gene expression profiles.Biochemical and Biophysical Research
Communications 419(2):148-53.
[6]Guyon I,Weston J,Barnhill S,Vapnik V(2002).Gene selection for cancer
classification using support vector machines.Machine Learning 46(1-3):389-
422.
[7]Duan KB,Rajapakse JC,Wang HY,Azuaje F(2005).Multiple SVM-RFE for gene
selection in cancer classification with expression data.Ieee Transactions on
Nanobioscience 4(3):228-34.
[8]Zhang XG,Lu X,Shi Q,Xu XQ,Leung HCE,Harris LN,D Iglehart J,Miron A,Liu
JS,Wong WH(2006).Recursive SVM feature selection and sample classification
for mass-spectrometry and microarray data.BMC Bioinformatics 7:-.
[9]Zhou X,Tuck DP(2007).MSVM-RFE:extensions of SVM-RFE for multiclass
gene selection on DNA microarray data(vol 23,pg 1106,2007).Bioinformatics 23
(15):2029-.
[10]Tang YC,Zhang YQ,Huang Z(2007).Development of two-stage SVM-RFE gene
selection strategy for microarray expression data analysis.Ieee-Acm
Transactions on Computational Biology and Bioinformatics 4(3):365-81.
[11]Mundra PA,Rajapakse JC(2010).SVM-RFE With MRMR Filter for Gene
Selection.Ieee Transactions on Nanobioscience 9(1):31-7.
[12]Le Cun Y,Denker J,Solla S,Touretzky DS.Optimal brain damage.Advances
in Neural Information Processing Systems:Morgan Kaufmann;1990.p.598-605.
[13]Tan F,Fu X,Zhang Y,Bourgeois A(2008).A genetic algorithm-based method
for feature subset selection.Soft Computing 12(2):111-20.
[14]Nicoletta D,Barbara P(2009).An evolutionary method for combining
different feature selection criteria in microarray data classification.2009:
1-10.
[15]Cannas L,Dessi N,Pes B.A Hybrid Model to Favor the Selection of High
Quality Features in High Dimensional Domains.Intelligent Data Engineering and
Automated Learning-IDEAL 2011:Springer Berlin Heidelberg;2011.p.228-35.
[16]Mark H,Eibe F,Geoffrey H,Bernhard P,Peter R,Ian HW(2009).The WEKA
data mining software:an update.SIGKDD Explor Newsl 11(1):10-8.
[17]Singh D,Febbo PG,Ross K,Jackson DG,Manola J,Ladd C,Tamayo P,Renshaw
AA,D'Amico AV,Richie JP,Lander ES,Loda M,Kantoff PW,Golub TR,Sellers WR
(2002).Gene expression correlates of clinical prostate cancer behavior.Cancer
Cell 1(2):203-9.
[18]Staunton JE,Slonim DK,Coller HA,Tamayo P,Angelo MJ,Park J,Scherf U,
Lee JK,Reinhold WO,Weinstein JN,Mesirov JP,Lander ES,Golub TR(2001)
.Chemosensitivity prediction by transcriptional profiling.Proc Natl Acad Sci
U S A 98(19):10787-92.
[19]Tan AC,Naiman DQ,Xu L,Winslow RL,Geman D(2005).Simple decision rules
for classifying human cancers from gene expression profiles.Bioinformatics 21
(20):3896-904.
[20]Peng SH,Xu QH,Ling XB,Peng XN,Du W,Chen LB(2003).Molecular
classification of cancer types from microarray data using the combination of
genetic algorithms and support vector machines.Febs Letters 555(2):358-62.
[21]Ooi CH,Tan P(2003).Genetic algorithms applied to multi-class
prediction for the analysis of gene expression data.Bioinformatics 19(1):37-
44。
Claims (3)
1. a method for mixing and optimizing for tumor-related gene search, is characterized in that, specifically implements according to following steps:
Step 1, support vector machine recursive feature elimination algorithm is utilized to obtain " ranked genes collection "
For a linear SVM classifier, there is an optimum hyperplane, its class interval is defined as:
Margin width=2/ | | w | |, (2)
Wherein, w is the vertical vector of optimal hyperlane;
xiIt is the sample i gene expression vector in training set, i=1,2 ..., k, k are the numbers supporting vector;
ci∈ [-1 ,+1] is the class label of sample i;
Weight αiThen it is calculated from training set, the weight α of most training vectorsiIt is zero, if the weight of this sample training vector
αiFor nonzero value, then for supporting vector, margin width refers to class interval;
SVM-RFE uses the step of backward elimination, repeatedly deletes each gene minimum to SVM classifier contribution, SVM-RFE's
Object function J is defined as:
J=(1/2) | | w | |2, (3)
By approximate second Taylor series expansion J, approach each gene of removal with Optimal Brain Damage algorithm and cause J
Change, then have:
During the optimization of J, its single order Taylor series are left in the basket, and then its second order Taylor series become:
Δ J (i)=(Δ wi)2, (5)
Due to Δ wi=wiWeight change relevant to removing ith feature in grader, therefore (wi)2By as SVM-RFE's
Scoring criteria, has the eigenvalue (w of minimum every timei)2Feature will be eliminated;
Step 2, set up candidate gene collection Ωk
Select before ranking the gene of n as candidate gene collection, parameter n specifically regard microarray gene expression data collection situation and
Fixed;
Step 3, to candidate gene collection Ωk, utilize genetic algorithm to search solution space
Principle based on the survival of the fittest, the development of every generation will produce more more preferable approximate solution, in each generation, often each and every one
Body is evaluated by the fitness function of Problem Areas, and the individuality more adapted to is retained;Then, with intersecting and the heredity behaviour of variation
Make, create new solution collection;Circulation performs this process, until predetermined end condition;
Step 4, determine optimum gene set
Gene set is respectively organized, relatively the precision of prediction of each model and average gene polyadenylation signal collection size obtained by comparison step 3;In advance
In the case of survey precision is identical, the size selecting average gene subset is minimum as optimal parameter n, and transports with this optimal parameter n
OK, obtain the gene subset that gene dosage is minimum and precision of prediction is the highest, i.e. " optimum gene set ", be somebody's turn to do in " optimum gene set "
Gene is i.e. considered tumor-related gene.
The method for mixing and optimizing of tumor-related gene the most according to claim 1 search, is characterized in that, described step 1
In, utilize comprising the concrete steps that of SVM-RFE algorithm:
Input initial gene collection I={1;2;... n} and ranked genes collection O={};
Repeat the following step 1.1-1.4, until initial gene collection I is empty:
1.1) using initial gene collection I as input variable, training dataset training linear SVM is used;
1.2) to all genes in initial gene collection I, calculate each gene score, calculate scoring criteria ri=(wi)2;
1.3) select to have the gene of minimum Rank scores: g=arg min{ri};
1.4) ranked genes collection O and initial gene collection I:O=O ∪ g, I=I-g are updated respectively, by gene g from initial gene collection I
Remove, add ranked genes collection O;Finally output obtains a ranked genes collection O.
The method for mixing and optimizing of tumor-related gene the most according to claim 1 search, is characterized in that, described step 3
In, utilize comprising the concrete steps that of GA algorithm:
3.1) individual expression: each individuality is by a N position binary vector coding, and wherein N is the size in heredity space, place value
Represent selected gene for " 1 ", for " 0 ", place value then represents that this gene is not selected;
3.2) fitness function is set: each individuality is assessed by support vector machine classifier, the SMO classification of such as WEKA platform
Device, object function makes the classification error rate of grader minimize;
3.3) genetic operator is set: genetic manipulation passes through roulette selection, is intersected by single-point and bit flipping sudden change is implemented.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610555700.6A CN106228034A (en) | 2016-07-12 | 2016-07-12 | A kind of method for mixing and optimizing of tumor-related gene search |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610555700.6A CN106228034A (en) | 2016-07-12 | 2016-07-12 | A kind of method for mixing and optimizing of tumor-related gene search |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106228034A true CN106228034A (en) | 2016-12-14 |
Family
ID=57520292
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610555700.6A Pending CN106228034A (en) | 2016-07-12 | 2016-07-12 | A kind of method for mixing and optimizing of tumor-related gene search |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106228034A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709267A (en) * | 2017-01-25 | 2017-05-24 | 武汉贝纳科技服务有限公司 | Data acquisition method and device |
CN108615555A (en) * | 2018-04-26 | 2018-10-02 | 山东师范大学 | Colorectal cancer prediction technique and device based on marker gene and mixed kernel function SVM |
CN112729411A (en) * | 2021-01-14 | 2021-04-30 | 金陵科技学院 | Distributed drug warehouse environment monitoring method based on GA-RNN |
CN113901999A (en) * | 2021-09-29 | 2022-01-07 | 国网四川省电力公司电力科学研究院 | Fault diagnosis method and system for high-voltage shunt reactor |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090175531A1 (en) * | 2004-11-19 | 2009-07-09 | Koninklijke Philips Electronics, N.V. | System and method for false positive reduction in computer-aided detection (cad) using a support vector macnine (svm) |
CN102170130A (en) * | 2011-04-26 | 2011-08-31 | 华北电力大学 | Short-term wind power prediction method |
CN102272764A (en) * | 2009-01-06 | 2011-12-07 | 皇家飞利浦电子股份有限公司 | Evolutionary clustering algorithm |
CN103186717A (en) * | 2013-01-18 | 2013-07-03 | 中国科学院合肥物质科学研究院 | Heuristic breadth-first searching method for cancer-related genes |
-
2016
- 2016-07-12 CN CN201610555700.6A patent/CN106228034A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090175531A1 (en) * | 2004-11-19 | 2009-07-09 | Koninklijke Philips Electronics, N.V. | System and method for false positive reduction in computer-aided detection (cad) using a support vector macnine (svm) |
CN102272764A (en) * | 2009-01-06 | 2011-12-07 | 皇家飞利浦电子股份有限公司 | Evolutionary clustering algorithm |
CN102170130A (en) * | 2011-04-26 | 2011-08-31 | 华北电力大学 | Short-term wind power prediction method |
CN103186717A (en) * | 2013-01-18 | 2013-07-03 | 中国科学院合肥物质科学研究院 | Heuristic breadth-first searching method for cancer-related genes |
Non-Patent Citations (1)
Title |
---|
XIAOBO LI: "Gene selection for cancer classification using the combination of SVM-RFE and GA", 《COMPUTER MODELLING & NEW TECHNOLOGIES》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709267A (en) * | 2017-01-25 | 2017-05-24 | 武汉贝纳科技服务有限公司 | Data acquisition method and device |
CN108615555A (en) * | 2018-04-26 | 2018-10-02 | 山东师范大学 | Colorectal cancer prediction technique and device based on marker gene and mixed kernel function SVM |
CN112729411A (en) * | 2021-01-14 | 2021-04-30 | 金陵科技学院 | Distributed drug warehouse environment monitoring method based on GA-RNN |
CN112729411B (en) * | 2021-01-14 | 2022-09-13 | 金陵科技学院 | Distributed drug warehouse environment monitoring method based on GA-RNN |
CN113901999A (en) * | 2021-09-29 | 2022-01-07 | 国网四川省电力公司电力科学研究院 | Fault diagnosis method and system for high-voltage shunt reactor |
CN113901999B (en) * | 2021-09-29 | 2023-09-29 | 国网四川省电力公司电力科学研究院 | Fault diagnosis method and system for high-voltage shunt reactor |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sayed et al. | A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets | |
Algamal et al. | Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification | |
Alshamlan et al. | mRMR‐ABC: a hybrid gene selection algorithm for cancer classification using microarray gene expression profiling | |
Algamal et al. | Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification | |
Thakur et al. | [Retracted] Gene Expression‐Assisted Cancer Prediction Techniques | |
Chuang et al. | A hybrid feature selection method for DNA microarray data | |
Abdi et al. | A novel weighted support vector machine based on particle swarm optimization for gene selection and tumor classification | |
Hameed et al. | Filter-Wrapper Combination and Embedded Feature Selection for Gene Expression Data. | |
Jörnsten | Clustering and classification based on the L1 data depth | |
JP2020501240A (en) | Methods and systems for predicting DNA accessibility in pan-cancer genomes | |
CN112201346B (en) | Cancer lifetime prediction method, device, computing equipment and computer readable storage medium | |
Abdulla et al. | G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays | |
CN106228034A (en) | A kind of method for mixing and optimizing of tumor-related gene search | |
Dhillon et al. | eBreCaP: extreme learning‐based model for breast cancer survival prediction | |
Luque-Baena et al. | Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data | |
Kumar et al. | An amalgam method efficient for finding of cancer gene using CSC from micro array data | |
Thakur et al. | Machine learning techniques with ANOVA for the prediction of breast cancer | |
Ghorai et al. | Multicategory cancer classification from gene expression data by multiclass NPPC ensemble | |
Ray et al. | Transforming Breast Cancer Identification: An In-Depth Examination of Advanced Machine Learning Models Applied to Histopathological Images | |
Yang et al. | Feature selection using memetic algorithms | |
Huang et al. | Classifying breast cancer subtypes on multi-omics data via sparse canonical correlation analysis and deep learning | |
Shi et al. | Integration of Cancer Genomics Data for Tree‐based Dimensionality Reduction and Cancer Outcome Prediction | |
Bustamam et al. | Lung cancer classification based on support vector machine-recursive feature elimination and artificial bee colony | |
Jia et al. | DCCAFN: deep convolution cascade attention fusion network based on imaging genomics for prediction survival analysis of lung cancer | |
Palmal et al. | Integrative prognostic modeling for breast cancer: Unveiling optimal multimodal combinations using graph convolutional networks and calibrated random forest |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161214 |
|
RJ01 | Rejection of invention patent application after publication |