CN106971091A - A kind of tumour recognition methods based on certainty particle group optimizing and SVMs - Google Patents

A kind of tumour recognition methods based on certainty particle group optimizing and SVMs Download PDF

Info

Publication number
CN106971091A
CN106971091A CN201710122492.5A CN201710122492A CN106971091A CN 106971091 A CN106971091 A CN 106971091A CN 201710122492 A CN201710122492 A CN 201710122492A CN 106971091 A CN106971091 A CN 106971091A
Authority
CN
China
Prior art keywords
particle
gene
certainty
value
sigma
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710122492.5A
Other languages
Chinese (zh)
Other versions
CN106971091B (en
Inventor
韩飞
李佳玲
凌青华
周从华
崔宝祥
宋余庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201710122492.5A priority Critical patent/CN106971091B/en
Publication of CN106971091A publication Critical patent/CN106971091A/en
Application granted granted Critical
Publication of CN106971091B publication Critical patent/CN106971091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Mathematical Physics (AREA)
  • Public Health (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Genetics & Genomics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of tumour recognition methods based on certainty particle group optimizing and SVMs, pretreatment including expressing oncogene modal data, primary election is carried out to information gene with classification information index method on training set, then removing redundancy gene using redundancy approach two-by-two obtains alternative gene pool;Crucial gene subset is further obtained using classification information index method on training set;The parameter of SVMs is optimized using certainty particle swarm optimization algorithm on training set, then oncogene expression modal data to be identified is identified.The present invention is making full use of SVMs to be suitable on the characteristics of Small Sample Database is recognized, SVMs is optimized with certainty particle group optimizing, further improves the performance of SVMs, so as to improve tumour identification accuracy.

Description

A kind of tumour recognition methods based on certainty particle group optimizing and SVMs
Technical field
The invention belongs to the application field of the computer analytical technology of oncogene expression modal data, and in particular to Yi Zhongji In the tumour recognition methods of certainty particle group optimizing and SVMs.
Background technology
DNA microarray technology is that biology brings huge opportunity, but its a large amount of and complicated microarray for producing Data, the scholars to association area propose huge challenge, and its main cause has four:First, contain in microarray data Much noise or exceptional value.Since noise and exceptional value can be usually produced in experimentation, and data handling procedure also can band Come error or sample class marked erroneous, accordingly, it is desirable to be able to the strong processing method of design robustness.Second, gene expression profile Data scale is huge, and how to handle large-scale dataset is also to need one of difficult point of solution.Therefore, design is calculated and space is answered Miscellaneous degree all relatively low efficient algorithms just become very meaningful.3rd, microarray data has high-dimensional, the feature of low sample. Gene expression profile data collection, its sort operation scale increases with gene data and exponentially increased, so how to tackle dimension Disaster problem is also one of difficult point.4th, there is non-linear behavior, and conceal a large amount of practical informations in microarray data.Cause This, makes the statistical analysis technique of classics be transformed into nonlinear analysis method processing nonlinear data collection, and utilize these methods Seem extremely important to excavate and derive these potential biological informations.
Since Golub in 1999 etc. has started the beginning in staging field of gene expression profile, scholar land It is continuous to propose many sorting techniques based on gene expression profile, wherein there is some algorithm commonly used.By different classification Algorithm can be designed that different graders, such as Bayes, SVMs, the classical taxonomy device such as artificial neural network, they It can be learnt according to known sample class information, to extract the information of sample classification.Based on these graders in tumour Classification field test result indicates that, different graders is different to the classification capacity of same data set, that is to say, that one Individual good grader is difficult all very high to the classification performances of all data sets.SVM advantage applies to handle higher-dimension sample Notebook data, and nicety of grading is high, noise resisting ability is strong, it is not necessary to adjust and input substantial amounts of parameter.In addition, with can spend Amount property, i.e., typically smaller by the number of supporting vector after training, this comes to the ever-increasing gene expression profile of matrix dimension Say highly effective.Although SVM is recognized suitable for Small Sample Database, the selection of parameter is relatively time-consuming in SVM, and does not have also at present There is the selection of parameter in effective theories integration SVM, so as to influence SVM classification performance.
Particle cluster algorithm (Particle Swarm Optimization, PSO) has good ability of searching optimum.Phase For genetic algorithm, PSO has without complicated genetic manipulation, and adjustable parameter is few, it is easy to accomplish the advantages of, therefore it is obtained in recent years To being widely applied.And traditional PS O, due to the randomness of particle search, blind search number of times is more, cause search time compared with Long, search performance has much room for improvement.Therefore, the Deterministic searching based on gradient search is introduced into particle swarm optimization algorithm, by inciting somebody to action Random search and Deterministic searching combine the search speed and precision for improving population.
The content of the invention
Objects of the present invention:Carry out the parameter of Support Vector Machines Optimized with improved particle swarm optimization algorithm (IGPSO), It is accurate to improve tumour identification so as to improve the search performance of SVMs, and applied to the identification of tumour expression modal data Property.Relative to traditional tumour express spectra recognition methods, this method is effectively improved tumour recognition accuracy.
Technical scheme:A kind of tumour recognition methods based on certainty particle group optimizing and SVMs, including based on The gene subset screening of classification information index and two-by-two redundancy approach, and utilize certainty particle swarm optimization algorithm (improved particle swarm optimization based on gradient search, IGPSO) optimization is supported Vector machine come realize oncogene express modal data identification, comprise the following steps:
Step 1 oncogene expresses the pretreatment of spectrum data set, and oncogene expression spectrum data set is divided into instruction first Practice collection and test set, then data set is normalized, obtain final key gene subset;
Step 2 proposes certainty particle swarm optimization algorithm (IGPSO), on training set, uses certainty particle group optimizing Algorithm optimization SVMs (SVM);
Step 3 is on test set, using optimizing obtained support vector machines in step 2 come to oncogene express spectra Data set is identified;
Further, comprised the steps of in the step 1:
Oncogene expression spectrum data set is divided into training set and test set by step 1.1;
Step 1.2 calculates " the classification information index " of each gene in training set according to formula (1).
Wherein d (g) is gene g classification information index,Respectively gene g is expressed in the positive negative sample of two classes The average of level,WithThe standard deviation of respectively gene g expressions in the positive negative sample of two classes.
Step 1.3 selection is used as the gene after preliminary filtering more than all genes of some threshold value (classification information index) Collection.
Step 1.4 is after using classification information index method tentatively filtering, and calculating is two-by-two between gene expression dose Pearson correlation coefficient, selection reduces the size of alternative gene pool again more than the gene of some value.
Step 1.5 reuses classification information index method in alternative gene pool, and selection is all more than some threshold value Gene is used as final key gene subset.
Further, propose that certainty particle swarm optimization algorithm is comprised the steps of in the step 2:
The kind of step 2.1 position (x) of random initializtion population, speed (v) and each function in initial range Group's threshold of diversity (σ);
Step 2.2 calculate each particle adaptive value and for fitness function its position gradient;
Step 2.3 is for each particle, and the adaptive value for the desired positions that its adaptive value and Individual Experience are crossed is compared, If more preferably, as current optimal location;
Step 2.4 is for each particle, and the adaptive value for the desired positions that its adaptive value is undergone with colony is compared, If more preferably, as colony's optimal location;
Step 2.5 is when population diversity value is more than the threshold value of setting, and the speed of each particle is carried out more according to formula (2) Newly, otherwise it is updated according to formula (5), and the position of more new particle;
Two stages are divided into based on certainty particle swarm optimization algorithm, the first stage is the process that attracts each other of particle, it Two steps can be divided into, first, when population diversity value is more than some appropriate threshold value, particle is along fitness function to it The negative gradient direction of position, gathers towards global optimum's particle;When searching some optimal vertex neighborhood, using the plan progressively declined Slightly, the speed of particle is constantly reduced to carry out linear search.Two steps of this stage are respectively adopted formula (2) and formula (3) to retouch State.
vij(t+1)=w*gra (i, j)+c2*rand () * (pg-xij(t)) (2)
vij(t+1)=k*vij(t) (3)
Wherein Vi=(vi1, vi2..., vin) be particulate i current flight speed, Xi=(xi1, xi2..., xin) For particulate i current location, w is inertia weight, pgFor global desired positions, k is the constant between (0,1);For fitness Function f (x), its corresponding negative gradient gra (i, j) is as follows:
Second stage is the mutually exclusive process of particle.When population diversity value is less than predetermined threshold value, adaptively Particle is repelled to improve population diversity, while direction of the particle along gradient is scanned for and local most to other Advantage is close.Obviously, population diversity is bigger, and its speed of scattering is smaller, and population diversity is smaller, and its speed of scattering is bigger. Particle rapidity more new formula is as follows:
Wherein diversity is the population diversity calculated by (6) formula.
Wherein S is population, | S | the particulate number included for colony, | L | it is the greatest radius of search space, N is problem Dimension, pijFor j-th of component of i-th of particulate.
If step 2.6 not up to end condition, goes to step 2.2, otherwise exports adaptive value.
Further, comprised the steps of in the step 3 using certainty particle swarm optimization algorithm Support Vector Machines Optimized:
Step 3.1:Set SVM C, σ parameter search spaces, xI, min≤xi≤xI, max, wherein C is penalty factor, and σ is core Function parameter, xiFor parameter value, i represents number of parameters, and 2 are set to here, is chosen at random when algorithm starts on search space One parameter value x;
SVM classifying rules equation such as formula (7):
Training set T { (xi,yi);xi∈Rn;yi=± 1;I=1,2 ..., r }, wherein:xiFor training sample, x is to wait to sentence Disconnected sample, b is thresholding, αiIt is Lagrange multiplier, K (xi, x) it is kernel function;
The optimization problem of SVMs solution and constructed categorised decision function are as follows:
Wherein K (x, xi) it is kernel function, xiFor training sample, b is thresholding, αiIt is Lagrange multiplier, its effect is by it Feature space is mapped to higher dimensional space.In actual applications, characterizing gene quantity is small,So using the SVM classifier based on RBF Tumor sample is classified, RBF is expressed as follows:
Step 3.2:The size for setting population is N, and classification accuracy requirement is F, and spreading factor is Ex, and local size is W=[w1, w2], maximum reattempt times are Tmax, number of retries t and expansion factor start as 0;
Step 3.3:According to the search space p=[p of initialization when algorithm starts1, p2], expand by spreading factor Ex and search for Space, is calculated as follows local positions so that local falls in this search space p+0.6Ex*w;
Step 3.4:Calculate the corresponding classification performance function f of xp
Step 3.5:Optimal value is found with IGPSO algorithms, the corresponding classification performance function f of optimal value is drawnc
Step 3.6:If searching more preferable classification rate (fp< fc), then t and Ex are set to 0, otherwise t=t+1;
Step 3.7:If t >=Tmax, then it is 0, Ex=Ex+1 to put t, is now possible to be absorbed in local optimum, increase search Scope is to jump out current regional area;
Step 3.8:If reaching classification accuracy requirement (fp≤ F), then export the value and classification accuracy of { C, σ }, algorithm Terminate, otherwise go to step 3.3.
Beneficial effect:There are this many hashes, supporting vector in the oncogene expression spectrum data set of higher-dimension small sample Machine has good extensive effect, and the classification of data is used for always.However, the classification performance of SVMs is selected dependent on parameter Select, this problem, which never has, preferably to be solved, and greatly limit SVM application.Certainty in the present invention The particle swarm optimization algorithm of search carries out Local Search by gradient information, when particle search is to some optimal vertex neighborhood, constantly The speed of particle is reduced, step-length of the control particle in linear search is unlikely to excessive;With Biodiversity Characteristics and attract and Exclusion principle controls the overall situation, when entering local optimum, adaptively particle is repelled and ensures diversity, finally can A higher solution of precision is rapidly converged to, thus, optimize SVM using the particle swarm optimization algorithm based on Deterministic searching, it is excellent Change RBF kernel functional parameters and penalty factor, improve SVM classification performance, be so conducive to improving oncogene expression modal data Recognition accuracy.
Brief description of the drawings
Fig. 1 is the structured flowchart of the present invention;
Fig. 2 is the flow chart of certainty particle swarm optimization algorithm in the present invention;
Embodiment
A kind of tumour recognition methods based on certainty particle group optimizing and SVMs, including referred to based on classification information The screening of number and the two-by-two gene of redundancy approach, and utilize certainty particle swarm optimization algorithm (IGPSO) optimization supporting vector The step of machine carries out oncogene identification, comprises the following steps:
Step 1 oncogene expresses the pretreatment of spectrum data set, and oncogene expression spectrum data set is divided into instruction first Practice collection and test set, then data set is normalized, obtain final key gene subset;
Step 2 proposes certainty particle swarm optimization algorithm (IGPSO);
Step 3 uses certainty particle swarm optimization algorithm Support Vector Machines Optimized (SVM) on training set;
Step 4 is on test set, using optimizing obtained support vector machines in step 3 come to oncogene express spectra Data set is identified;
Further, comprised the steps of in the step 1:
Oncogene expression spectrum data set is divided into training set and test set by step 1.1;
Step 1.2 calculates " the classification information index " of each gene in training set according to formula (1).
Wherein d (g) is gene g classification information index,Respectively gene g is expressed in the positive negative sample of two classes The average of level,WithThe standard deviation of respectively gene g expressions in the positive negative sample of two classes.
Step 1.3 selection is used as the gene after preliminary filtering more than all genes of some threshold value (classification information index) Collection.
Step 1.4 is after using classification information index method tentatively filtering, and calculating is two-by-two between gene expression dose Pearson correlation coefficient, selection reduces the size of alternative gene pool again more than the gene of some value.
Step 1.5 reuses classification information index method in alternative gene pool, and selection is all more than some threshold value Gene is used as final key gene subset.
Further, comprised the steps of in the step 2:
The kind of step 2.1 position (x) of random initializtion population, speed (v) and each function in initial range Group's threshold of diversity (σ);
Step 2.2 calculate each particle adaptive value and for fitness function its position gradient;
Step 2.3 is for each particle, and the adaptive value for the desired positions that its adaptive value and Individual Experience are crossed is compared, If more preferably, as current optimal location;
Step 2.4 is for each particle, and the adaptive value for the desired positions that its adaptive value is undergone with colony is compared, If more preferably, as colony's optimal location;
Step 2.5 is when population diversity value is more than the threshold value of setting, and the speed of each particle is carried out more according to formula (2) Newly, otherwise it is updated according to formula (5), and the position of more new particle;
Two stages are divided into based on certainty particle swarm optimization algorithm, the first stage is the process that attracts each other of particle, it Two steps can be divided into, first, when population diversity value is more than some appropriate threshold value, particle is along fitness function to it The negative gradient direction of position, gathers towards global optimum's particle;When searching some optimal vertex neighborhood, using the plan progressively declined Slightly, the speed of particle is constantly reduced to carry out linear search.Two steps of this stage are respectively adopted formula (2) and formula (3) to retouch State.
vij(t+1)=w*gra (i, j)+c2*rand () * (pg-xij(t)) (2)
vij(t+1)=k*vij(t) (3)
Wherein Vi=(vi1, vi2..., vin) be particulate i current flight speed, Xi=(xi1, xi2..., xin) For particulate i current location, w is inertia weight, pgFor global desired positions, k is the constant between (0,1);For fitness Function f (x), its corresponding negative gradient gra (i, j) is as follows:
Second stage is the mutually exclusive process of particle.When population diversity value is less than predetermined threshold value, adaptively Particle is repelled to improve population diversity, while direction of the particle along gradient is scanned for and local most to other Advantage is close.Obviously, population diversity is bigger, and its speed of scattering is smaller, and population diversity is smaller, and its speed of scattering is bigger. Particle rapidity more new formula is as follows:
Wherein diversity is the population diversity calculated by (6) formula.
Wherein S is population, | S | the particulate number included for colony, | L | it is the greatest radius of search space, N is problem Dimension, pijFor j-th of component of i-th of particulate.
If step 2.6 not up to end condition, goes to step 2.2, otherwise exports adaptive value.
Further, comprised the steps of in the step 3:
Step 3.1:Set SVM C, σ parameter search spaces, xI, min≤xi≤xI, max, wherein C is penalty factor, and σ is core Function parameter, xiFor parameter value, i represents number of parameters, and 2 are set to here, is chosen at random when algorithm starts on search space One parameter value x;
SVM classifying rules equation such as formula (7):
Training set T { (xi,yi);xi∈Rn;yi=± 1;I=1,2 ..., r }, wherein:xiFor training sample, x is to wait to sentence Disconnected sample, b is thresholding, αiIt is Lagrange multiplier, K (xi, x) it is kernel function;
The optimization problem of SVMs solution and constructed categorised decision function are as follows:
Wherein K (x, xi) it is kernel function, xiFor training sample, b is thresholding, αiIt is Lagrange multiplier, its effect is by it Feature space is mapped to higher dimensional space.In actual applications, characterizing gene quantity is small, so using the SVM classifier based on RBF Tumor sample is classified, RBF is expressed as follows:
Step 3.2:The size for setting population is N, and classification accuracy requirement is F, and spreading factor is Ex, and local size is W=[w1, w2], maximum reattempt times are Tmax, number of retries t and expansion factor start as 0;
Step 3.3:According to the search space p=[p of initialization when algorithm starts1, p2], expand by spreading factor Ex and search for Space, is calculated as follows local positions so that local falls in this search space p+0.6Ex*w;
Step 3.4:Calculate the corresponding classification performance function f of xp
Step 3.5:Optimal value is found with IGPSO algorithms, the corresponding classification performance function f of optimal value is drawnc
Step 3.6:If searching more preferable classification rate (fp< fc), then t and Ex are set to 0, otherwise t=t+1;
Step 3.7:If t >=Tmax, then it is 0, Ex=Ex+1 to put t, is now possible to be absorbed in local optimum, increase search Scope is to jump out current regional area;
Step 3.8:If reaching classification accuracy requirement (fp≤ F), then export the value and classification accuracy of { C, σ }, algorithm Terminate, otherwise go to step 3.3.
Below with oncogene express spectra data instance, the implementation procedure of the present invention is simplyd illustrate.This example selection knot Intestinal cancer tumour expresses spectrum data set, and altogether comprising 62 samples, each sample is represented with the expression value of 2000 genes. This 62 genes include 22 normal samples and 40 tumor samples.On the data set, specific execution step of the invention is such as Under:
As shown in figure 1, a kind of tumour recognition methods based on certainty particle group optimizing and SVMs, including based on The screening of the gene of classification information index and two-by-two redundancy approach, and it is excellent using certainty particle swarm optimization algorithm (IGPSO) Change the step of SVMs carries out oncogene identification, comprise the following steps:
(1) data set is divided into training set and test set, on training set, using the improvement in classification information index method Signal to noise ratio formula is calculated each gene.The information index of gene is bigger, its sample classification information included it is corresponding compared with It is many, it is also corresponding stronger to the classification capacity of sample.Shown by table 1 is exactly colon cancer data set classification information distribution situation.Knot 173 genes of the information index more than 0.5 are selected in intestinal cancer data set as the gene subset of lower surface analysis.
(2) Pearson correlation coefficient between the exclusion of redundancy gene is by calculating two gene expression doses.Colon cancer number Analyzed according to using 173 genes selected by above method.Calculate and compare through " redundancy two-by-two ", finally give 59 Gene.
(3) the gene subset that obtains to more than is calculated further according to the improvement signal to noise ratio formula in classification information index, 11 genes of colon cancer data set information index are selected, final key gene subset is used as.Shown by table 2 be exactly colon cancer most The key gene subset that screened eventually.
(4) to two parameters progress initial setting up of SVMs, its hunting zone { the < σ < 6 of 0 < C < 16,0 }, most The expansion step-length that the expansion step-length that big number of retries is set to 10, C is 0.3, σ is that 0.1. combines the crucial base of 11 obtained before The two parameters of SVMs are optimized by cause with IGPSO algorithms.IGPSO algorithms in local particle according to (performance Function) classification rate is scanned for the gradient direction of parameter { C, σ }, if reaching maximum reattempt times, do not find and preferably divides Class rate, then expand hunting zone.
(5) on test set, with classifying after IGPSO algorithm optimizations SVM to it.Shown by table 3 is exactly colon cancer Classification situation in sample.
Table 1 gives colon cancer classification information exponential distribution.
Colon cancer classification information exponential distribution in the present invention of table 1
Gene information index Gene dosage Account for the ratio of 2000 genes
0.0~0.3 1524 76.2
0.3~0.5 303 15.15
0.5~1.897 173 8.65
Table 2 gives the colon cancer gene subset to be classified
Colon cancer gene subset in the present invention of table 2
Table 3 gives the classification situation in colon cancer sample in the present invention, when penalty factor is smaller, and it classifies wrong Rate is all higher by mistake, when C increases, drastically reduces, i.e., classification performance is improved rapidly, and continues to increase C, and the change of classification performance is simultaneously Not substantially, after C increases to certain value, performance no longer changes with C change, i.e., SVM is unwise to C in the larger context Sense.C classification accuracy rates in the range of (6,15) are obtained in an experiment higher, that is to say, that SVM is insensitive to C in this region. Experiment shows, in the state of optimal classification effect, appropriate to reduce σ value, that is, appropriate amendment is carried out to it, can be obvious Improve classification accuracy rate.Finally test result indicates that σ values between (0.9,1.88) have good classifying quality.
Classification of the present invention of table 3 on colon cancer sample
Table 4 gives the comparison of method proposed by the present invention and SVM correlation techniques.
The inventive method of table 4 and the comparison of SVM correlation techniques
In the description of this specification, reference term " one embodiment ", " some embodiments ", " illustrative examples ", The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example are described Structure, material or feature are contained at least one embodiment of the present invention or example.In this manual, to above-mentioned term Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description Point can in an appropriate manner be combined in any one or more embodiments or example.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this The scope of invention is limited by claim and its equivalent.

Claims (4)

1. a kind of tumour recognition methods based on certainty particle group optimizing and SVMs, it is characterised in that including following Step:
Step 1 oncogene expresses the pretreatment of spectrum data set, and oncogene expression spectrum data set is divided into training set first And test set, then data set is normalized, final key gene subset is obtained;Step 2 proposes certainty grain Subgroup optimized algorithm IGPSO, on training set, uses certainty particle swarm optimization algorithm Support Vector Machines Optimized SVM;Step 3 On test set, oncogene expression spectrum data set is known using obtained support vector machines are optimized in step 2 Not.
2. the tumour recognition methods according to claim 1 based on certainty particle group optimizing and SVMs, it is special Levy and be, comprised the steps of in the step 1:
Oncogene expression spectrum data set is divided into training set and test set by step 1.1;
Step 1.2 calculates " the classification information index " of each gene in training set according to formula (1);
d ( g ) = 1 2 | μ g + - μ g - | σ g + + σ g - + 1 2 ln ( σ g + 2 + σ g - 2 2 σ g + σ g - ) - - - ( 1 )
Wherein d (g) is gene g classification information index,Respectively gene g expressions in the positive negative sample of two classes Average,WithThe standard deviation of respectively gene g expressions in the positive negative sample of two classes;
Step 1.3 selection is used as the gene set after preliminary filtering more than all genes of some classification information index threshold;
Step 1.4 calculates the Pearson between gene expression dose two-by-two after using classification information index method tentatively filtering Coefficient correlation, chooses the gene set more than some value, reduces the size of alternative gene pool again;
Step 1.5 reuses classification information index side to more reduce the scope of key gene collection in alternative gene pool Method, selection is used as final key gene subset more than all genes of some threshold value.
3. the tumour recognition methods according to claim 1 based on certainty particle group optimizing and SVMs, it is special Levy and be, propose that certainty particle swarm optimization algorithm IGPSO is comprised the steps of in the step 2:
The population diversity of step 2.1 the position x of random initializtion population, speed v and each function in initial range Threshold value σ;
Step 2.2 calculate each particle adaptive value and for fitness function its position gradient;
Step 2.3 is for each particle, and the adaptive value for the desired positions that its adaptive value and Individual Experience are crossed is compared, if More preferably, then as current optimal location;
Step 2.4 is for each particle, and the adaptive value for the desired positions that its adaptive value is undergone with colony is compared, if More preferably, then as colony's optimal location;
Step 2.5 is when population diversity value is more than the threshold value of setting, and the speed of each particle is updated according to formula (2), no Then it is updated according to formula (5), and the position of more new particle;
Two stages are divided into based on certainty particle swarm optimization algorithm, the first stage is the process that attracts each other of particle, is divided into two Individual step, first, when population diversity value is more than some appropriate threshold value, particle is born along fitness function to its position Gradient direction, gathers towards global optimum's particle;When searching some optimal vertex neighborhood, using the strategy progressively declined, constantly The speed of particle is reduced to carry out linear search;Two steps of this stage are respectively adopted formula (2) and formula (3) to describe;
vij(t+1)=w*gra (i, j)+c2*rand()*(pg-xij(t)) (2)
vij(t+1)=k*vij(t) (3)
Wherein Vi=(vi1, vi2..., vin) be particulate i current flight speed, Xi=(xi1, xi2..., xin) it is micro- Grain i current location, w is inertia weight, pgFor global desired positions, k is the constant between (0,1);For fitness function f (x), its corresponding negative gradient gra (i, j) is as follows:
Second stage is the mutually exclusive process of particle;When population diversity value is less than predetermined threshold value, adaptively to grain Son is repelled to improve population diversity, while direction of the particle along gradient is scanned for and to other local best points It is close;Population diversity is bigger, and its speed of scattering is smaller, and population diversity is smaller, and its speed of scattering is bigger;Particle rapidity is more New formula is as follows:
v i j ( t + 1 ) = w * v i j ( t ) + c 1 * r a n d ( ) * g r a ( i , j ) - c 2 * r a n d ( ) * ( 1 d i v e r s i t y ) * ( p g - x i j ( t ) ) - - - ( 5 )
Wherein diversity is the population diversity calculated by (6) formula;
d i v e r s i t y ( S ) = 1 | S | · | L | · Σ i = 1 | S | · Σ j = 1 N ( p i j - p j ‾ ) 2 - - - ( 6 )
Wherein S is population, | S | the particulate number included for colony, | L | it is the greatest radius of search space, N is the dimension of problem Number, pijFor j-th of component of i-th of particulate;
If step 2.6 not up to end condition, goes to step 2.2, otherwise exports adaptive value.
4. the tumour recognition methods according to claim 1 based on certainty particle group optimizing and SVMs, it is special Levy and be, comprised the steps of in the step 2 using certainty particle swarm optimization algorithm Support Vector Machines Optimized SVM:
Step 3.1:Set SVM C, σ parameter search spaces, xI, min≤xi≤xI, max, wherein penalty factor, kernel functional parameter σ, xiFor parameter value, i represents number of parameters, and 2 are set to here, chooses at random a parameter when algorithm starts on search space Value x;SVM classifying rules equation such as formula (7):
f ( x ) = Σ i = 1 r α i y i K ( x i , x ) + b - - - ( 7 )
Training set T { (xi,yi);xi∈Rn;yi=± 1;I=1,2 ..., r }, wherein:xiFor training sample, x is sample to be judged This, b is thresholding, αiIt is Lagrange multiplier, K (xi, x) it is kernel function;
The optimization problem of SVMs solution and constructed categorised decision function are as follows:
min α 1 2 Σ i = 1 r Σ j = 1 r y i y j α i α j K ( x i , x j ) - Σ j = 1 r α j - - - ( 8 )
s . t . Σ i = 1 r y i α i = 0 , 0 ≤ α i ≤ C , i = 1 , 2 , ... r ;
f ( x ) = sgn ( Σ i = 1 r α i y i K ( x , x i ) + b ) - - - ( 9 )
Wherein K (x, xi) it is kernel function, xiFor training sample, b is thresholding, αiIt is Lagrange multiplier, its effect is by its feature Space reflection is to higher dimensional space;In actual applications, characterizing gene quantity is small, so using the SVM classifier based on RBF to swollen Knurl sample is classified, and RBF is expressed as follows:
K ( x , x i ) = exp ( - | | x - x i | | 2 2 σ 2 ) - - - ( 10 )
Step 3.2:The size for setting population is N, and classification accuracy requirement is F, and spreading factor is Ex, and local size is w= [w1, w2], maximum reattempt times are Tmax, number of retries t and expansion factor start as 0;
Step 3.3:According to the search space p=[p of initialization when algorithm starts1, p2], expand search space by spreading factor Ex, Local positions are calculated as follows so that local falls in this search space p+0.6Ex*w;
Step 3.4:Calculate the corresponding classification performance function f of xp
Step 3.5:Optimal value is found with IGPSO algorithms, the corresponding classification performance function f of optimal value is drawnc
Step 3.6:If searching more preferable classification rate (fp< fc), then t and Ex are set to 0, otherwise t=t+1;
Step 3.7:If t >=Tmax, then it is 0, Ex=Ex+1 to put t, is now possible to be absorbed in local optimum, increases hunting zone To jump out current regional area;
Step 3.8:If reaching classification accuracy requirement (fp≤ F), then the value and classification accuracy of { C, σ } are exported, algorithm terminates, Otherwise 3.3 are gone to step.
CN201710122492.5A 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine Active CN106971091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710122492.5A CN106971091B (en) 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710122492.5A CN106971091B (en) 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine

Publications (2)

Publication Number Publication Date
CN106971091A true CN106971091A (en) 2017-07-21
CN106971091B CN106971091B (en) 2020-08-28

Family

ID=59328372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710122492.5A Active CN106971091B (en) 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine

Country Status (1)

Country Link
CN (1) CN106971091B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107643948A (en) * 2017-09-30 2018-01-30 广东欧珀移动通信有限公司 Application program management-control method, device, medium and electronic equipment
CN108629158A (en) * 2018-05-14 2018-10-09 浙江大学 A kind of intelligent Lung Cancer cancer cell detector
CN108875305A (en) * 2018-05-14 2018-11-23 浙江大学 A kind of leukaemia cancer cell detector of colony intelligence optimizing
CN109947941A (en) * 2019-03-05 2019-06-28 永大电梯设备(中国)有限公司 A kind of method and system based on elevator customer service text classification
CN110060740A (en) * 2019-04-16 2019-07-26 中国科学院深圳先进技术研究院 A kind of nonredundancy gene set clustering method, system and electronic equipment
CN111383710A (en) * 2020-03-13 2020-07-07 闽江学院 Gene splice site recognition model construction method based on particle swarm optimization gemini support vector machine
CN111582370A (en) * 2020-05-08 2020-08-25 重庆工贸职业技术学院 Brain metastasis tumor prognostic index reduction and classification method based on rough set optimization
CN113707216A (en) * 2021-08-05 2021-11-26 北京科技大学 Infiltration immune cell proportion counting method
CN113808659A (en) * 2021-08-26 2021-12-17 四川大学 Feedback phase regulation and control method based on gene gradient particle swarm optimization

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258244A (en) * 2013-04-28 2013-08-21 西北师范大学 Method for predicting inhibiting concentration of pyridazine HCV NS5B polymerase inhibitor based on particle swarm optimization support vector machine
CN105372202A (en) * 2015-10-27 2016-03-02 九江学院 Genetically modified cotton variety recognition method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258244A (en) * 2013-04-28 2013-08-21 西北师范大学 Method for predicting inhibiting concentration of pyridazine HCV NS5B polymerase inhibitor based on particle swarm optimization support vector machine
CN105372202A (en) * 2015-10-27 2016-03-02 九江学院 Genetically modified cotton variety recognition method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
童燕 等: "一种改进的基于粒子群优化的SVM训练算法", 《计算机工程与应用》 *
郝爱丽: "支持向量机分类模型下的肿瘤基因辨识研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
韩飞 等: "一种改进的基于梯度搜索的粒子群优化算法", 《南京大学学报(自然科学)》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107643948A (en) * 2017-09-30 2018-01-30 广东欧珀移动通信有限公司 Application program management-control method, device, medium and electronic equipment
CN107643948B (en) * 2017-09-30 2020-06-02 Oppo广东移动通信有限公司 Application program control method, device, medium and electronic equipment
CN108629158A (en) * 2018-05-14 2018-10-09 浙江大学 A kind of intelligent Lung Cancer cancer cell detector
CN108875305A (en) * 2018-05-14 2018-11-23 浙江大学 A kind of leukaemia cancer cell detector of colony intelligence optimizing
CN109947941A (en) * 2019-03-05 2019-06-28 永大电梯设备(中国)有限公司 A kind of method and system based on elevator customer service text classification
CN110060740A (en) * 2019-04-16 2019-07-26 中国科学院深圳先进技术研究院 A kind of nonredundancy gene set clustering method, system and electronic equipment
CN111383710A (en) * 2020-03-13 2020-07-07 闽江学院 Gene splice site recognition model construction method based on particle swarm optimization gemini support vector machine
CN111582370A (en) * 2020-05-08 2020-08-25 重庆工贸职业技术学院 Brain metastasis tumor prognostic index reduction and classification method based on rough set optimization
CN113707216A (en) * 2021-08-05 2021-11-26 北京科技大学 Infiltration immune cell proportion counting method
CN113808659A (en) * 2021-08-26 2021-12-17 四川大学 Feedback phase regulation and control method based on gene gradient particle swarm optimization
CN113808659B (en) * 2021-08-26 2023-06-13 四川大学 Feedback phase regulation and control method based on gene gradient particle swarm algorithm

Also Published As

Publication number Publication date
CN106971091B (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN106971091A (en) A kind of tumour recognition methods based on certainty particle group optimizing and SVMs
Ghareb et al. Hybrid feature selection based on enhanced genetic algorithm for text categorization
Song et al. Feature selection using bare-bones particle swarm optimization with mutual information
CN105426426B (en) A kind of KNN file classification methods based on improved K-Medoids
Li et al. An ant colony optimization based dimension reduction method for high-dimensional datasets
Huang et al. Using glowworm swarm optimization algorithm for clustering analysis
CN108363810A (en) A kind of file classification method and device
Alomari et al. A hybrid filter-wrapper gene selection method for cancer classification
CN106778853A (en) Unbalanced data sorting technique based on weight cluster and sub- sampling
CN104750844A (en) Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN110909158B (en) Text classification method based on improved firefly algorithm and K nearest neighbor
CN105045913B (en) File classification method based on WordNet and latent semantic analysis
CN106548041A (en) A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
Arowolo et al. A survey of dimension reduction and classification methods for RNA-Seq data on malaria vector
CN108171012A (en) A kind of gene sorting method and device
CN113436684A (en) Cancer classification and characteristic gene selection method
CN106951728B (en) Tumor key gene identification method based on particle swarm optimization and scoring criterion
Das et al. Group incremental adaptive clustering based on neural network and rough set theory for crime report categorization
Chen et al. A new particle swarm feature selection method for classification
CN105512675A (en) Memory multi-point crossover gravitational search-based feature selection method
Wang et al. Bayesian penalized method for streaming feature selection
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
Afif et al. Genetic algorithm rule based categorization method for textual data mining
Sahu Multi filter ensemble method for cancer prognosis and Diagnosis

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant