CN106971091A

CN106971091A - A kind of tumour recognition methods based on certainty particle group optimizing and SVMs

Info

Publication number: CN106971091A
Application number: CN201710122492.5A
Authority: CN
Inventors: 韩飞; 李佳玲; 凌青华; 周从华; 崔宝祥; 宋余庆
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2017-03-03
Filing date: 2017-03-03
Publication date: 2017-07-21
Anticipated expiration: 2037-03-03
Also published as: CN106971091B

Abstract

The invention discloses a kind of tumour recognition methods based on certainty particle group optimizing and SVMs, pretreatment including expressing oncogene modal data, primary election is carried out to information gene with classification information index method on training set, then removing redundancy gene using redundancy approach two-by-two obtains alternative gene pool；Crucial gene subset is further obtained using classification information index method on training set；The parameter of SVMs is optimized using certainty particle swarm optimization algorithm on training set, then oncogene expression modal data to be identified is identified.The present invention is making full use of SVMs to be suitable on the characteristics of Small Sample Database is recognized, SVMs is optimized with certainty particle group optimizing, further improves the performance of SVMs, so as to improve tumour identification accuracy.

Description

A kind of tumour recognition methods based on certainty particle group optimizing and SVMs

Technical field

The invention belongs to the application field of the computer analytical technology of oncogene expression modal data, and in particular to Yi Zhongji In the tumour recognition methods of certainty particle group optimizing and SVMs.

Background technology

DNA microarray technology is that biology brings huge opportunity, but its a large amount of and complicated microarray for producing Data, the scholars to association area propose huge challenge, and its main cause has four：First, contain in microarray data Much noise or exceptional value.Since noise and exceptional value can be usually produced in experimentation, and data handling procedure also can band Come error or sample class marked erroneous, accordingly, it is desirable to be able to the strong processing method of design robustness.Second, gene expression profile Data scale is huge, and how to handle large-scale dataset is also to need one of difficult point of solution.Therefore, design is calculated and space is answered Miscellaneous degree all relatively low efficient algorithms just become very meaningful.3rd, microarray data has high-dimensional, the feature of low sample. Gene expression profile data collection, its sort operation scale increases with gene data and exponentially increased, so how to tackle dimension Disaster problem is also one of difficult point.4th, there is non-linear behavior, and conceal a large amount of practical informations in microarray data.Cause This, makes the statistical analysis technique of classics be transformed into nonlinear analysis method processing nonlinear data collection, and utilize these methods Seem extremely important to excavate and derive these potential biological informations.

Since Golub in 1999 etc. has started the beginning in staging field of gene expression profile, scholar land It is continuous to propose many sorting techniques based on gene expression profile, wherein there is some algorithm commonly used.By different classification Algorithm can be designed that different graders, such as Bayes, SVMs, the classical taxonomy device such as artificial neural network, they It can be learnt according to known sample class information, to extract the information of sample classification.Based on these graders in tumour Classification field test result indicates that, different graders is different to the classification capacity of same data set, that is to say, that one Individual good grader is difficult all very high to the classification performances of all data sets.SVM advantage applies to handle higher-dimension sample Notebook data, and nicety of grading is high, noise resisting ability is strong, it is not necessary to adjust and input substantial amounts of parameter.In addition, with can spend Amount property, i.e., typically smaller by the number of supporting vector after training, this comes to the ever-increasing gene expression profile of matrix dimension Say highly effective.Although SVM is recognized suitable for Small Sample Database, the selection of parameter is relatively time-consuming in SVM, and does not have also at present There is the selection of parameter in effective theories integration SVM, so as to influence SVM classification performance.

Particle cluster algorithm (Particle Swarm Optimization, PSO) has good ability of searching optimum.Phase For genetic algorithm, PSO has without complicated genetic manipulation, and adjustable parameter is few, it is easy to accomplish the advantages of, therefore it is obtained in recent years To being widely applied.And traditional PS O, due to the randomness of particle search, blind search number of times is more, cause search time compared with Long, search performance has much room for improvement.Therefore, the Deterministic searching based on gradient search is introduced into particle swarm optimization algorithm, by inciting somebody to action Random search and Deterministic searching combine the search speed and precision for improving population.

The content of the invention

Objects of the present invention：Carry out the parameter of Support Vector Machines Optimized with improved particle swarm optimization algorithm (IGPSO), It is accurate to improve tumour identification so as to improve the search performance of SVMs, and applied to the identification of tumour expression modal data Property.Relative to traditional tumour express spectra recognition methods, this method is effectively improved tumour recognition accuracy.

Technical scheme：A kind of tumour recognition methods based on certainty particle group optimizing and SVMs, including based on The gene subset screening of classification information index and two-by-two redundancy approach, and utilize certainty particle swarm optimization algorithm (improved particle swarm optimization based on gradient search, IGPSO) optimization is supported Vector machine come realize oncogene express modal data identification, comprise the following steps：

Step 1 oncogene expresses the pretreatment of spectrum data set, and oncogene expression spectrum data set is divided into instruction first Practice collection and test set, then data set is normalized, obtain final key gene subset；

Step 2 proposes certainty particle swarm optimization algorithm (IGPSO), on training set, uses certainty particle group optimizing Algorithm optimization SVMs (SVM)；

Step 3 is on test set, using optimizing obtained support vector machines in step 2 come to oncogene express spectra Data set is identified；

Further, comprised the steps of in the step 1：

Oncogene expression spectrum data set is divided into training set and test set by step 1.1；

Step 1.2 calculates " the classification information index " of each gene in training set according to formula (1).

Wherein d (g) is gene g classification information index,Respectively gene g is expressed in the positive negative sample of two classes The average of level,WithThe standard deviation of respectively gene g expressions in the positive negative sample of two classes.

Step 1.3 selection is used as the gene after preliminary filtering more than all genes of some threshold value (classification information index) Collection.

Step 1.4 is after using classification information index method tentatively filtering, and calculating is two-by-two between gene expression dose Pearson correlation coefficient, selection reduces the size of alternative gene pool again more than the gene of some value.

Step 1.5 reuses classification information index method in alternative gene pool, and selection is all more than some threshold value Gene is used as final key gene subset.

Further, propose that certainty particle swarm optimization algorithm is comprised the steps of in the step 2：

The kind of step 2.1 position (x) of random initializtion population, speed (v) and each function in initial range Group's threshold of diversity (σ)；

Step 2.2 calculate each particle adaptive value and for fitness function its position gradient；

Step 2.3 is for each particle, and the adaptive value for the desired positions that its adaptive value and Individual Experience are crossed is compared, If more preferably, as current optimal location；

Step 2.4 is for each particle, and the adaptive value for the desired positions that its adaptive value is undergone with colony is compared, If more preferably, as colony's optimal location；

Step 2.5 is when population diversity value is more than the threshold value of setting, and the speed of each particle is carried out more according to formula (2) Newly, otherwise it is updated according to formula (5), and the position of more new particle；

Two stages are divided into based on certainty particle swarm optimization algorithm, the first stage is the process that attracts each other of particle, it Two steps can be divided into, first, when population diversity value is more than some appropriate threshold value, particle is along fitness function to it The negative gradient direction of position, gathers towards global optimum's particle；When searching some optimal vertex neighborhood, using the plan progressively declined Slightly, the speed of particle is constantly reduced to carry out linear search.Two steps of this stage are respectively adopted formula (2) and formula (3) to retouch State.

v_ij(t+1)=w*gra (i, j)+c2*rand () * (p_g-x_ij(t)) (2)

v_ij(t+1)=k*v_ij(t) (3)

Wherein V_i=(v_i1, v_i2..., v_in) be particulate i current flight speed, X_i=(x_i1, x_i2..., x_in) For particulate i current location, w is inertia weight, p_gFor global desired positions, k is the constant between (0,1)；For fitness Function f (x), its corresponding negative gradient gra (i, j) is as follows：

Second stage is the mutually exclusive process of particle.When population diversity value is less than predetermined threshold value, adaptively Particle is repelled to improve population diversity, while direction of the particle along gradient is scanned for and local most to other Advantage is close.Obviously, population diversity is bigger, and its speed of scattering is smaller, and population diversity is smaller, and its speed of scattering is bigger. Particle rapidity more new formula is as follows：

Wherein diversity is the population diversity calculated by (6) formula.

Wherein S is population, | S | the particulate number included for colony, | L | it is the greatest radius of search space, N is problem Dimension, p_ijFor j-th of component of i-th of particulate.

If step 2.6 not up to end condition, goes to step 2.2, otherwise exports adaptive value.

Further, comprised the steps of in the step 3 using certainty particle swarm optimization algorithm Support Vector Machines Optimized：

Step 3.1：Set SVM C, σ parameter search spaces, x_{I, min}≤x_i≤x_{I, max}, wherein C is penalty factor, and σ is core Function parameter, x_iFor parameter value, i represents number of parameters, and 2 are set to here, is chosen at random when algorithm starts on search space One parameter value x；

SVM classifying rules equation such as formula (7):

Training set T { (x_i,y_i)；x_i∈Rⁿ；y_i=± 1；I=1,2 ..., r }, wherein：x_iFor training sample, x is to wait to sentence Disconnected sample, b is thresholding, α_iIt is Lagrange multiplier, K (x_i, x) it is kernel function；

The optimization problem of SVMs solution and constructed categorised decision function are as follows：

Wherein K (x, x_i) it is kernel function, x_iFor training sample, b is thresholding, α_iIt is Lagrange multiplier, its effect is by it Feature space is mapped to higher dimensional space.In actual applications, characterizing gene quantity is small_,So using the SVM classifier based on RBF Tumor sample is classified, RBF is expressed as follows：

Step 3.2：The size for setting population is N, and classification accuracy requirement is F, and spreading factor is Ex, and local size is W=[w₁, w₂], maximum reattempt times are T_max, number of retries t and expansion factor start as 0；

Step 3.3：According to the search space p=[p of initialization when algorithm starts₁, p₂], expand by spreading factor Ex and search for Space, is calculated as follows local positions so that local falls in this search space p+0.6Ex*w；

Step 3.4：Calculate the corresponding classification performance function f of x_p；

Step 3.5：Optimal value is found with IGPSO algorithms, the corresponding classification performance function f of optimal value is drawn_c；

Step 3.6：If searching more preferable classification rate (f_p＜ f_c), then t and Ex are set to 0, otherwise t=t+1；

Step 3.7：If t >=T_max, then it is 0, Ex=Ex+1 to put t, is now possible to be absorbed in local optimum, increase search Scope is to jump out current regional area；

Step 3.8：If reaching classification accuracy requirement (f_p≤ F), then export the value and classification accuracy of { C, σ }, algorithm Terminate, otherwise go to step 3.3.

Beneficial effect：There are this many hashes, supporting vector in the oncogene expression spectrum data set of higher-dimension small sample Machine has good extensive effect, and the classification of data is used for always.However, the classification performance of SVMs is selected dependent on parameter Select, this problem, which never has, preferably to be solved, and greatly limit SVM application.Certainty in the present invention The particle swarm optimization algorithm of search carries out Local Search by gradient information, when particle search is to some optimal vertex neighborhood, constantly The speed of particle is reduced, step-length of the control particle in linear search is unlikely to excessive；With Biodiversity Characteristics and attract and Exclusion principle controls the overall situation, when entering local optimum, adaptively particle is repelled and ensures diversity, finally can A higher solution of precision is rapidly converged to, thus, optimize SVM using the particle swarm optimization algorithm based on Deterministic searching, it is excellent Change RBF kernel functional parameters and penalty factor, improve SVM classification performance, be so conducive to improving oncogene expression modal data Recognition accuracy.

Brief description of the drawings

Fig. 1 is the structured flowchart of the present invention；

Fig. 2 is the flow chart of certainty particle swarm optimization algorithm in the present invention；

Embodiment

A kind of tumour recognition methods based on certainty particle group optimizing and SVMs, including referred to based on classification information The screening of number and the two-by-two gene of redundancy approach, and utilize certainty particle swarm optimization algorithm (IGPSO) optimization supporting vector The step of machine carries out oncogene identification, comprises the following steps：

Step 2 proposes certainty particle swarm optimization algorithm (IGPSO)；

Step 3 uses certainty particle swarm optimization algorithm Support Vector Machines Optimized (SVM) on training set；

Step 4 is on test set, using optimizing obtained support vector machines in step 3 come to oncogene express spectra Data set is identified；

Further, comprised the steps of in the step 1：

Further, comprised the steps of in the step 2：

v_ij(t+1)=w*gra (i, j)+c2*rand () * (p_g-x_ij(t)) (2)

v_ij(t+1)=k*v_ij(t) (3)

Wherein diversity is the population diversity calculated by (6) formula.

Further, comprised the steps of in the step 3：

SVM classifying rules equation such as formula (7):

Wherein K (x, x_i) it is kernel function, x_iFor training sample, b is thresholding, α_iIt is Lagrange multiplier, its effect is by it Feature space is mapped to higher dimensional space.In actual applications, characterizing gene quantity is small, so using the SVM classifier based on RBF Tumor sample is classified, RBF is expressed as follows：

Below with oncogene express spectra data instance, the implementation procedure of the present invention is simplyd illustrate.This example selection knot Intestinal cancer tumour expresses spectrum data set, and altogether comprising 62 samples, each sample is represented with the expression value of 2000 genes. This 62 genes include 22 normal samples and 40 tumor samples.On the data set, specific execution step of the invention is such as Under：

As shown in figure 1, a kind of tumour recognition methods based on certainty particle group optimizing and SVMs, including based on The screening of the gene of classification information index and two-by-two redundancy approach, and it is excellent using certainty particle swarm optimization algorithm (IGPSO) Change the step of SVMs carries out oncogene identification, comprise the following steps：

(1) data set is divided into training set and test set, on training set, using the improvement in classification information index method Signal to noise ratio formula is calculated each gene.The information index of gene is bigger, its sample classification information included it is corresponding compared with It is many, it is also corresponding stronger to the classification capacity of sample.Shown by table 1 is exactly colon cancer data set classification information distribution situation.Knot 173 genes of the information index more than 0.5 are selected in intestinal cancer data set as the gene subset of lower surface analysis.

(2) Pearson correlation coefficient between the exclusion of redundancy gene is by calculating two gene expression doses.Colon cancer number Analyzed according to using 173 genes selected by above method.Calculate and compare through " redundancy two-by-two ", finally give 59 Gene.

(3) the gene subset that obtains to more than is calculated further according to the improvement signal to noise ratio formula in classification information index, 11 genes of colon cancer data set information index are selected, final key gene subset is used as.Shown by table 2 be exactly colon cancer most The key gene subset that screened eventually.

(4) to two parameters progress initial setting up of SVMs, its hunting zone { the ＜ σ ＜ 6 of 0 ＜ C ＜ 16,0 }, most The expansion step-length that the expansion step-length that big number of retries is set to 10, C is 0.3, σ is that 0.1. combines the crucial base of 11 obtained before The two parameters of SVMs are optimized by cause with IGPSO algorithms.IGPSO algorithms in local particle according to (performance Function) classification rate is scanned for the gradient direction of parameter { C, σ }, if reaching maximum reattempt times, do not find and preferably divides Class rate, then expand hunting zone.

(5) on test set, with classifying after IGPSO algorithm optimizations SVM to it.Shown by table 3 is exactly colon cancer Classification situation in sample.

Table 1 gives colon cancer classification information exponential distribution.

Colon cancer classification information exponential distribution in the present invention of table 1

Gene information index	Gene dosage	Account for the ratio of 2000 genes
			0.0~0.3	1524	76.2
0.3~0.5	303	15.15
			0.5~1.897	173	8.65

Table 2 gives the colon cancer gene subset to be classified

Colon cancer gene subset in the present invention of table 2

Table 3 gives the classification situation in colon cancer sample in the present invention, when penalty factor is smaller, and it classifies wrong Rate is all higher by mistake, when C increases, drastically reduces, i.e., classification performance is improved rapidly, and continues to increase C, and the change of classification performance is simultaneously Not substantially, after C increases to certain value, performance no longer changes with C change, i.e., SVM is unwise to C in the larger context Sense.C classification accuracy rates in the range of (6,15) are obtained in an experiment higher, that is to say, that SVM is insensitive to C in this region. Experiment shows, in the state of optimal classification effect, appropriate to reduce σ value, that is, appropriate amendment is carried out to it, can be obvious Improve classification accuracy rate.Finally test result indicates that σ values between (0.9,1.88) have good classifying quality.

Classification of the present invention of table 3 on colon cancer sample

Table 4 gives the comparison of method proposed by the present invention and SVM correlation techniques.

The inventive method of table 4 and the comparison of SVM correlation techniques

In the description of this specification, reference term " one embodiment ", " some embodiments ", " illustrative examples ", The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example are described Structure, material or feature are contained at least one embodiment of the present invention or example.In this manual, to above-mentioned term Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description Point can in an appropriate manner be combined in any one or more embodiments or example.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that：Not In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this The scope of invention is limited by claim and its equivalent.

Claims

1. a kind of tumour recognition methods based on certainty particle group optimizing and SVMs, it is characterised in that including following Step：

Step 1 oncogene expresses the pretreatment of spectrum data set, and oncogene expression spectrum data set is divided into training set first And test set, then data set is normalized, final key gene subset is obtained；Step 2 proposes certainty grain Subgroup optimized algorithm IGPSO, on training set, uses certainty particle swarm optimization algorithm Support Vector Machines Optimized SVM；Step 3 On test set, oncogene expression spectrum data set is known using obtained support vector machines are optimized in step 2 Not.

2. the tumour recognition methods according to claim 1 based on certainty particle group optimizing and SVMs, it is special Levy and be, comprised the steps of in the step 1：

Step 1.2 calculates " the classification information index " of each gene in training set according to formula (1)；

d (g) = \frac{1}{2} \frac{| μ_{g^{+}} - μ_{g^{-}} |}{σ_{g^{+}} + σ_{g^{-}}} + \frac{1}{2} \ln (\frac{σ_{g^{+}}^{2} + σ_{g^{-}}^{2}}{2 σ_{g^{+}} σ_{g^{-}}}) - - - (1)

Wherein d (g) is gene g classification information index,Respectively gene g expressions in the positive negative sample of two classes Average,WithThe standard deviation of respectively gene g expressions in the positive negative sample of two classes；

Step 1.3 selection is used as the gene set after preliminary filtering more than all genes of some classification information index threshold；

Step 1.4 calculates the Pearson between gene expression dose two-by-two after using classification information index method tentatively filtering Coefficient correlation, chooses the gene set more than some value, reduces the size of alternative gene pool again；

Step 1.5 reuses classification information index side to more reduce the scope of key gene collection in alternative gene pool Method, selection is used as final key gene subset more than all genes of some threshold value.

3. the tumour recognition methods according to claim 1 based on certainty particle group optimizing and SVMs, it is special Levy and be, propose that certainty particle swarm optimization algorithm IGPSO is comprised the steps of in the step 2：

The population diversity of step 2.1 the position x of random initializtion population, speed v and each function in initial range Threshold value σ；

Step 2.3 is for each particle, and the adaptive value for the desired positions that its adaptive value and Individual Experience are crossed is compared, if More preferably, then as current optimal location；

Step 2.4 is for each particle, and the adaptive value for the desired positions that its adaptive value is undergone with colony is compared, if More preferably, then as colony's optimal location；

Step 2.5 is when population diversity value is more than the threshold value of setting, and the speed of each particle is updated according to formula (2), no Then it is updated according to formula (5), and the position of more new particle；

Two stages are divided into based on certainty particle swarm optimization algorithm, the first stage is the process that attracts each other of particle, is divided into two Individual step, first, when population diversity value is more than some appropriate threshold value, particle is born along fitness function to its position Gradient direction, gathers towards global optimum's particle；When searching some optimal vertex neighborhood, using the strategy progressively declined, constantly The speed of particle is reduced to carry out linear search；Two steps of this stage are respectively adopted formula (2) and formula (3) to describe；

v_ij(t+1)=w*gra (i, j)+c2*rand（）*(p_g-x_ij(t)) (2)

v_ij(t+1)=k*v_ij(t) (3)

Wherein V_i=(v_i1, v_i2..., v_in) be particulate i current flight speed, X_i=(x_i1, x_i2..., x_in) it is micro- Grain i current location, w is inertia weight, p_gFor global desired positions, k is the constant between (0,1)；For fitness function f (x), its corresponding negative gradient gra (i, j) is as follows：

Second stage is the mutually exclusive process of particle；When population diversity value is less than predetermined threshold value, adaptively to grain Son is repelled to improve population diversity, while direction of the particle along gradient is scanned for and to other local best points It is close；Population diversity is bigger, and its speed of scattering is smaller, and population diversity is smaller, and its speed of scattering is bigger；Particle rapidity is more New formula is as follows：

v_{i j} (t + 1) = w * v_{i j} (t) + c 1 * r a n d () * g r a (i, j) - c 2 * r a n d () * (\frac{1}{d i v e r s i t y}) * (p_{g} - x_{i j} (t)) - - - (5)

Wherein diversity is the population diversity calculated by (6) formula；

d i v e r s i t y (S) = \frac{1}{| S | \cdot | L |} \cdot Σ_{i = 1}^{| S |} \cdot \sqrt{Σ_{j = 1}^{N} {(p_{i j} - \overset{&OverBar;}{p_{j}})}^{2}} - - - (6)

Wherein S is population, | S | the particulate number included for colony, | L | it is the greatest radius of search space, N is the dimension of problem Number, p_ijFor j-th of component of i-th of particulate；

4. the tumour recognition methods according to claim 1 based on certainty particle group optimizing and SVMs, it is special Levy and be, comprised the steps of in the step 2 using certainty particle swarm optimization algorithm Support Vector Machines Optimized SVM：

Step 3.1：Set SVM C, σ parameter search spaces, x_{I, min}≤x_i≤x_{I, max}, wherein penalty factor, kernel functional parameter σ, x_iFor parameter value, i represents number of parameters, and 2 are set to here, chooses at random a parameter when algorithm starts on search space Value x；SVM classifying rules equation such as formula (7):

f (x) = Σ_{i = 1}^{r} α_{i} y_{i} K (x_{i}, x) + b - - - (7)

Training set T { (x_i,y_i)；x_i∈Rⁿ；y_i=± 1；I=1,2 ..., r }, wherein：x_iFor training sample, x is sample to be judged This, b is thresholding, α_iIt is Lagrange multiplier, K (x_i, x) it is kernel function；

\min_{α} \frac{1}{2} Σ_{i = 1}^{r} Σ_{j = 1}^{r} y_{i} y_{j} α_{i} α_{j} K (x_{i}, x_{j}) - Σ_{j = 1}^{r} α_{j} - - - (8)

s . t . Σ_{i = 1}^{r} y_{i} α_{i} = 0, 0 \leq α_{i} \leq C, i = 1, 2, ... r;

f (x) = sgn (Σ_{i = 1}^{r} α_{i} y_{i} K (x, x_{i}) + b) - - - (9)

Wherein K (x, x_i) it is kernel function, x_iFor training sample, b is thresholding, α_iIt is Lagrange multiplier, its effect is by its feature Space reflection is to higher dimensional space；In actual applications, characterizing gene quantity is small, so using the SVM classifier based on RBF to swollen Knurl sample is classified, and RBF is expressed as follows：

K (x, x_{i}) = \exp (- \frac{| | x - x_{i} | |^{2}}{2 σ^{2}}) - - - (10)

Step 3.2：The size for setting population is N, and classification accuracy requirement is F, and spreading factor is Ex, and local size is w= [w₁, w₂], maximum reattempt times are T_max, number of retries t and expansion factor start as 0；

Step 3.3：According to the search space p=[p of initialization when algorithm starts₁, p₂], expand search space by spreading factor Ex, Local positions are calculated as follows so that local falls in this search space p+0.6Ex*w；

Step 3.7：If t >=T_max, then it is 0, Ex=Ex+1 to put t, is now possible to be absorbed in local optimum, increases hunting zone To jump out current regional area；

Step 3.8：If reaching classification accuracy requirement (f_p≤ F), then the value and classification accuracy of { C, σ } are exported, algorithm terminates, Otherwise 3.3 are gone to step.