CN106971091B

CN106971091B - Tumor identification method based on deterministic particle swarm optimization and support vector machine

Info

Publication number: CN106971091B
Application number: CN201710122492.5A
Authority: CN
Inventors: 韩飞; 李佳玲; 凌青华; 周从华; 崔宝祥; 宋余庆
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2017-03-03
Filing date: 2017-03-03
Publication date: 2020-08-28
Anticipated expiration: 2037-03-03
Also published as: CN106971091A

Abstract

The invention discloses a tumor identification method based on deterministic particle swarm optimization and a support vector machine, which comprises the steps of preprocessing tumor gene expression spectrum data, carrying out primary selection on information genes on a training set by using a classified information index method, and then removing redundant genes by using a pairwise redundant method to obtain an alternative gene library; obtaining a key base factor set on the training set by further using a classification information index method; and (3) optimizing the parameters of the support vector machine by using a deterministic particle swarm optimization algorithm on the training set, and then identifying the tumor gene expression profile data to be identified. The method provided by the invention optimizes the support vector machine by using deterministic particle swarm optimization on the basis of fully utilizing the characteristic that the support vector machine is suitable for small sample data identification, so that the performance of the support vector machine is further improved, and the tumor identification accuracy is further improved.

Description

Tumor identification method based on deterministic particle swarm optimization and support vector machine

Technical Field

The invention belongs to the field of application of computer analysis technology of tumor gene expression profile data, and particularly relates to a tumor identification method based on deterministic particle swarm optimization and a support vector machine.

Background

DNA microarray technology has brought enormous opportunities for biology, but the large amount and complexity of microarray data it generates presents enormous challenges to scholars in the relevant field, for four main reasons: first, microarray data contains a large amount of noise or outliers. Because noise and abnormal values are often generated in the experimental process, and errors or sample class marking errors are also caused in the data processing process, it is desirable to design a processing method with strong robustness. Secondly, the gene expression profile data is large in scale, and how to process a large-scale data set is also one of the difficulties to be solved. Therefore, it becomes very meaningful to design an efficient algorithm with low computational and spatial complexity. Third, microarray data is characterized by high dimensionality, low sample size. The classification operation scale of the gene expression profile data set grows exponentially along with the growth of gene data, so that how to deal with the problem of dimension disaster is also one of the difficulties. Fourthly, microarray data has nonlinear characteristics and hides a large amount of practical information. Therefore, it is important to transform classical statistical analysis methods into nonlinear analysis methods to process nonlinear data sets, and to use these methods to mine and deduce these potential biological information.

Since the beginning of 1999, Golub et al opened the field of tumor classification of gene expression profiles, scholars have successively proposed many classification methods based on gene expression profiles, some of which have been commonly used. Different classifiers can be designed according to different classification algorithms, such as Bayes, support vector machines, artificial neural networks and other classical classifiers, which can learn according to known sample class information to extract sample classification information. The experimental results of these classifiers in the field of tumor classification show that different classifiers have different classification capabilities for the same data set, that is, a good classifier has a difficulty in performing classification on all data sets. The SVM has the advantages of being suitable for processing high-dimensional small sample data, high in classification accuracy, strong in noise resistance and free of adjusting and inputting a large number of parameters. In addition, the method has scalability, namely the number of the support vectors is generally smaller after training, which is very effective for gene expression profiles with increasing matrix dimensions. Although the SVM is suitable for small sample data identification, the selection of parameters in the SVM is time-consuming, and no effective theory supports the selection of the parameters in the SVM at present, so that the classification performance of the SVM is influenced.

Particle Swarm Optimization (PSO) has good global search capability. Compared with a genetic algorithm, the PSO has the advantages of no complex genetic operation, few adjustable parameters, easy realization and the like, so that the PSO is widely applied in recent years. In the conventional PSO, the search time is long and the search performance needs to be improved due to the randomness of the particle search and the large number of blind searches. Therefore, deterministic search based on gradient search is introduced into a particle swarm optimization algorithm, and random search and deterministic search are combined to improve the search speed and accuracy of the population.

Disclosure of Invention

The invention aims at: the parameters of the support vector machine are optimized by using an improved particle swarm optimization (IGPSO), so that the searching performance of the support vector machine is improved, and the method is applied to the identification of tumor expression spectrum data to improve the tumor identification accuracy. Compared with the traditional tumor expression profile identification method, the method effectively improves the tumor identification accuracy.

The technical scheme is as follows: a tumor identification method based on deterministic particle swarm optimization and a support vector machine comprises the steps of screening a base factor set based on a classification information index and a pairwise redundancy method, and realizing the identification of tumor gene expression spectrum data by utilizing a deterministic particle swarm optimization (IGPSO) optimization support vector machine, and comprises the following steps:

step 1, preprocessing a tumor gene expression profile data set, namely dividing the tumor gene expression profile data set into a training set and a testing set, and then carrying out normalization processing on the data set to obtain a final key gene subset;

step 2, providing a deterministic particle swarm optimization algorithm (IGPSO), and optimizing a Support Vector Machine (SVM) by using the deterministic particle swarm optimization algorithm on a training set;

step 3, on the test set, using the support vector machine SVM obtained by optimization in the step 2 to identify the tumor gene expression profile data set;

further, the step 1 comprises the following steps:

step 1.1, dividing a tumor gene expression profile data set into a training set and a testing set;

step 1.2 the "classification information index" of each gene in the training set is calculated according to equation (1).

Wherein d (g) is a taxonomic information index of gene g,

respectively the mean value of the expression levels of the gene g in the two types of positive and negative samples,

and

the standard deviation of the expression level of the gene g in the two types of positive and negative samples is shown respectively.

Step 1.3 select all genes above a certain threshold (index of classification information) as the initially filtered set of genes.

Step 1.4 after preliminary filtering by using a classification information index method, calculating a Pearson correlation coefficient between two gene expression levels, selecting genes larger than a certain value, and reducing the size of the alternative gene library again.

Step 1.5 the taxonomic information index method is used again in the alternative gene library, selecting all genes above a certain threshold as the final key gene subset.

Further, the deterministic particle swarm optimization algorithm proposed in the step 2 comprises the following steps:

step 2.1, randomly initializing the position (x), the speed (v) and a population diversity threshold (sigma) of each function of the particle swarm in an initial range;

step 2.2 calculate the fitness value of each particle and the gradient of the fitness function at its location;

step 2.3, for each particle, comparing the adaptive value with the adaptive value of the best position which is experienced by an individual, and if the adaptive value is better, taking the adaptive value as the current optimal position;

step 2.4, for each particle, comparing the adaptive value with the adaptive value of the best position experienced by the group, and if the adaptive value is better, taking the adaptive value as the optimal position of the group;

step 2.5, when the population diversity value is larger than the set threshold value, updating the speed of each particle according to the formula (2), or updating according to the formula (5) and updating the positions of the particles;

the method comprises the following steps of dividing a deterministic particle swarm optimization algorithm into two stages, wherein the first stage is a mutual attraction process of particles, and the first stage can be divided into two steps, wherein firstly, when a population diversity value is larger than a certain proper threshold value, the particles gather towards globally optimal particles along a negative gradient direction of a fitness function to the positions of the particles; when a certain optimal point neighborhood is searched, a gradual descending strategy is adopted, and the speed of particles is continuously reduced to perform linear search. The two steps of this stage are described by the formulas (2) and (3), respectively.

v_ij(t+1)＝w*gra(i，j)+c2*rand()*(p_g-x_ij(t)) (2)

v_ij(t+1)＝k*v_ij(t) (3)

Wherein V_i＝(v_i1，v_i2，......，v_in) Is the current flying speed, X, of the particle i_i＝(x_i1，x_i2，......，x_in) Is the current position of the particle i, w is the inertial weight, p_gK is a constant between (0,1) for the global best position; for the fitness function f (x), its corresponding negative gradient gra (i, j) is as follows:

the second phase is the process of mutual repulsion of the particles. When the population diversity value is less than a predetermined threshold, the particles are adaptively rejected to increase the population diversity while the particles are searched along the direction of the gradient and are approached to other local optimum points. Obviously, the larger the population diversity, the smaller the dispersion speed thereof, and the smaller the population diversity, the larger the dispersion speed thereof. The particle velocity update formula is as follows:

wherein diversity is the population diversity calculated by equation (6).

And 2.6, if the end condition is not reached, turning to the step 2.2, otherwise, outputting an adaptive value.

Further, the optimizing the support vector machine by using the deterministic particle swarm optimization algorithm in the step 3 comprises the following steps:

step 3.1: setting the C, sigma parameter search space, x, of the SVM_i，min≤x_i≤x_i，maxWhere C is a penalty factor, σ is a kernel parameter, x_iI represents the number of parameters, and is set to be 2, and a parameter value x is randomly selected on a search space at the beginning of the algorithm;

the classification rule equation of the SVM is as follows (7):

training set T { (x)_i,y_i)；x_i∈Rⁿ；y_i± 1; 1, 2.., r }, wherein: x is the number of_iFor training samples, x is the sample to be judged, b is the threshold α_iIs the Lagrange multiplier, K (x)_iX) is a kernel function;

the optimization problem solved by the support vector machine and the constructed classification decision function are as follows:

wherein K (x, x)_i) Is a kernel function, x_iFor training samples, b is the threshold, α_iIs a lagrange multiplier whose role is to map its feature space to a high dimensional space. In practical application, the number of characteristic genes is small_，Therefore, the tumor samples are classified by using an SVM classifier based on RBF, which is expressed as follows:

step 3.2: setting the size of a particle swarm to be N, the requirement of classification accuracy to be F, the expansion factor to be Ex, and the local area size to be w ═ w₁，w₂]The maximum retry number is T_maxThe number of retries t and the expansion factor start to be 0;

step 3.3: the algorithm starts with the initialized search space p ═ p₁，p₂]Expanding the search space according to the expansion factor Ex, and calculating the local position according to the following formula so that the local position falls in the search space p +0.6Ex xw;

step 3.4: calculating the classification performance function f corresponding to x_p；

Step 3.5: searching an optimal value by using an IGPSO algorithm to obtain a classification performance function f corresponding to the optimal value_c；

Step 3.6: if a better classification rate (f) is found_p＜f_c) If the sum of the t and the Ex is 0, otherwise, t is t + 1;

step 3.7: if T is more than or equal to T_maxIf yes, setting t to be 0, and if Ex +1, possibly trapping local optima, and increasing a search range so as to jump out a current local area;

step 3.8: if the requirement of classification accuracy is met (f)_pF) is not more than the threshold value, the value of { C, sigma } and the classification accuracy are output, the algorithm is ended, otherwise, the step 3.3 is carried out.

Has the advantages that: the tumor gene expression profile data set of the high-dimensional small sample has a lot of useless data, and the support vector machine has a good generalization effect and is always used for data classification. However, the classification performance of the support vector machine depends on parameter selection, and the problem is not solved well all the time, and the application of the SVM is limited to a great extent. The particle swarm optimization algorithm for deterministic search carries out local search by means of gradient information, when a certain optimal point neighborhood is searched by the particles, the speed of the particles is continuously reduced, and the step length of the particles in linear search is controlled not to be too large; the global situation is controlled by applying diversity characteristics and an attraction and repulsion principle, when local optimization is achieved, particles are repelled in a self-adaptive mode to guarantee diversity, and finally a solution with high precision can be converged quickly.

Drawings

FIG. 1 is a block diagram of the present invention;

FIG. 2 is a flow chart of a deterministic particle swarm optimization algorithm in the present invention;

Detailed Description

A tumor identification method based on deterministic particle swarm optimization and a support vector machine comprises the steps of screening genes based on classification information indexes and pairwise redundancy methods, and utilizing a deterministic particle swarm optimization algorithm (IGPSO) to optimize the support vector machine for tumor gene identification, and comprises the following steps:

step 2, providing a deterministic particle swarm optimization algorithm (IGPSO);

step 3, optimizing a Support Vector Machine (SVM) by using a deterministic particle swarm optimization algorithm on a training set;

step 4, on the test set, using the support vector machine SVM obtained by optimization in the step 3 to identify a tumor gene expression profile data set;

further, the step 1 comprises the following steps:

Wherein d (g) is a taxonomic information index of gene g,

and

Further, the step 2 comprises the following steps:

v_ij(t+1)＝w*gra(i，j)+c2*rand()*(p_g-x_ij(t)) (2)

v_ij(t+1)＝k*v_ij(t) (3)

wherein diversity is the population diversity calculated by equation (6).

Further, the step 3 comprises the following steps:

the classification rule equation of the SVM is as follows (7):

wherein K (x, x)_i) Is a kernel function, x_iFor training samples, b is the threshold, α_iIs a lagrange multiplier whose role is to map its feature space to a high dimensional space. In practical application, the number of characteristic genes is small, so an SVM classifier based on RBF is adopted to classify the tumor samples, and the RBF is expressed as follows:

step 3.7: if T is more than or equal to T_maxIf t is 0, Ex +1, in this casePossibly getting into local optimum, and increasing the search range so as to jump out the current local area;

The following is a brief description of the implementation of the present invention, taking tumor gene expression profile data as an example. The present example selects a colon cancer tumor expression profile dataset comprising a total of 62 samples, each represented by expression level values for 2000 genes. These 62 genes contained 22 normal samples and 40 tumor samples. On the data set, the specific implementation steps of the invention are as follows:

as shown in fig. 1, a tumor identification method based on deterministic particle swarm optimization and support vector machine includes the steps of screening genes based on classification information index and pairwise redundancy method, and utilizing deterministic particle swarm optimization (IGPSO) to optimize support vector machine for tumor gene identification, including the following steps:

(1) the data set is divided into a training set and a testing set, and each gene is calculated on the training set by adopting an improved signal-to-noise ratio formula in a classified information index method. The larger the information index of the gene is, the more corresponding sample classification information is contained in the gene, and the classification capability of the gene to the sample is also stronger correspondingly. Table 1 shows the classification information distribution of the colon cancer data set. 173 genes with an informative index greater than 0.5 were selected in the colon cancer dataset as the subset of genes analyzed below.

(2) Redundant genes were excluded by calculating the Pearson correlation coefficient between the expression levels of the two genes. Colon cancer data was analyzed using 173 genes selected by the above method. Through pairwise redundancy calculation and comparison, 59 genes are finally obtained.

(3) And calculating the obtained gene subsets according to an improved signal-to-noise ratio formula in the classification information indexes, and selecting 11 genes of the colon cancer data set information indexes as final key gene subsets. Table 2 shows the key gene subset to be screened for colon cancer.

(4) The two parameters of the support vector machine are initially set, the search range {0 < C < 16,0 < sigma < 6}, the maximum retry number is set to be 10, the expanding step length of C is 0.3, and the expanding step length of sigma is 0.1. The IGPSO algorithm searches the gradient direction of the parameter { C, sigma } according to the (performance function) classification rate of the particles in a local area, and if the maximum retry number is reached and a better classification rate is not found, the search range is expanded.

(5) On the test set, the SVM is classified after being optimized by an IGPSO algorithm. Table 3 shows the classification of colon cancer samples.

Table 1 shows the colon cancer classification information index distribution.

TABLE 1 Colon cancer Classification information index distribution in the present invention

Index of genetic information	Number of genes	Accounts for 2000 genes
			0.0～0.3	1524	76.2
0.3～0.5	303	15.15
			0.5～1.897	173	8.65

Table 2 shows the subset of colon cancer genes to be classified

TABLE 2 Colon cancer gene sets of the present invention

Table 3 shows the classification condition in the colon cancer sample in the present invention, when the penalty factor C is small, the classification error rate is high, when C increases, the classification error rate decreases sharply, i.e., the classification performance increases rapidly, while when C continues to increase, the change of classification performance is not obvious, and when C increases to a certain value, the performance does not change with the change of C any more, i.e., the SVM is insensitive to C in a large range. In the experiment, the classification accuracy of C in the range of (6, 15) is high, namely, the SVM is not sensitive to C in the region. Experiments show that in the state of the best classification effect, the value of sigma is properly reduced, namely, the sigma is properly corrected, so that the classification accuracy can be obviously improved. The final experiment result shows that the value of sigma between (0.9 and 1.88) has good classification effect.

TABLE 3 Classification of the invention on Colon cancer samples

Table 4 shows a comparison of the proposed method of the present invention with the SVM related method.

TABLE 4 comparison of the method of the invention with methods related to SVM

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A tumor identification method based on deterministic particle swarm optimization and a support vector machine is characterized by comprising the following steps:

step 1, preprocessing a tumor gene expression profile data set, namely dividing the tumor gene expression profile data set into a training set and a testing set, and then carrying out normalization processing on the data set to obtain a final key gene subset; step 2, providing a deterministic particle swarm optimization (IGPSO), and optimizing a Support Vector Machine (SVM) by using the deterministic particle swarm optimization (IGPSO) on a training set; step 3, on the test set, using the support vector machine SVM obtained by optimization in the step 2 to identify the tumor gene expression profile data set;

the step 2 of optimizing the support vector machine SVM by using the deterministic particle swarm optimization algorithm comprises the following steps:

step 3.1: setting the C, sigma parameter search space, m, of the SVM_i，min≤m_i≤m_i，maxWherein the penalty factor C, the kernel parameter σ, m_iThe value of the ith parameter is represented by i, the number of the parameters is set to be 2, and a parameter value m is randomly selected on a search space at the beginning of the algorithm; the classification rule equation of the SVM is as follows (7):

training set T { (x)_i，y_i)；x_i∈Rⁿ；y_i± 1; 1, 2.., r }, wherein: x is the number of_iFor training samples, x is the sample to be judged, b is the threshold α_iIs the Lagrange multiplier, K (x)_iX) is a kernel function;

wherein K (x, x)_i) Is a kernel function, x_iFor training samples, b is the threshold, α_iIs a lagrange multiplier which has the function of mapping the characteristic space thereof to a high-dimensional space; in practical application, the number of characteristic genes is small, so an SVM classifier based on RBF is adopted to classify the tumor samples, and the RBF is expressed as follows:

step 3.3: the algorithm starts with the initialized search space p ═ p₁，p₂]Expanding the search space according to the expansion factor Ex, and calculating local positions according to the steps 3.4 to 3.7 so that the local positions fall into the search space p +0.6Ex xw;

step 3.4:calculating the classification performance function f corresponding to x_p；

Step 3.6: if a better classification rate is found, i.e. f_p＜f_cIf the sum of the t and the Ex is 0, otherwise, t is t + 1;

step 3.8: if the requirement of classification accuracy is met, f_pIf not more than F, outputting the value of { C, sigma } and the classification accuracy, ending the algorithm, otherwise, turning to the step 3.3;

the two parameters of the support vector machine are initially set, the search range {0 < C < 16,0 < sigma < 6}, the maximum retry number is set to 10, the expansion step length of C is 0.3, the expansion step length of sigma is 0.1, the two parameters of the support vector machine are optimized by an IGPSO algorithm in combination with a final key gene subset, the IGPSO algorithm searches the gradient direction of the parameters { C, sigma } according to the classification rate of a performance function in a local area, and if the maximum retry number is reached, a better classification rate is not found, the search range is expanded.

2. The method for tumor identification based on deterministic particle swarm optimization and support vector machine according to claim 1, wherein the step 1 comprises the following steps:

step 1.2 according to the formula (1), calculating the classification information index of each gene in the training set;

wherein d (g) is a taxonomic information index of gene g,

and

respectively is the standard deviation of the expression level of the gene g in the two types of positive and negative samples;

step 1.3, selecting all genes larger than a certain classification information index threshold value as a gene set after preliminary filtration;

step 1.4, after preliminary filtering by using a classification information index method, calculating a Pearson correlation coefficient between two gene expression levels, selecting a gene set larger than a certain value, and reducing the size of the alternative gene library again;

step 1.5 to further narrow the scope of the key gene set, the classification information index method is used again in the alternative gene library, and all genes larger than a certain threshold are selected as the final key gene subset.

3. The method for tumor identification based on deterministic particle swarm optimization and support vector machine according to claim 1, wherein the step 2 of proposing the deterministic particle swarm optimization algorithm IGPSO comprises the following steps:

step 2.1, randomly initializing the position X and the speed v of the particle swarm and a population diversity threshold sigma of each function in an initial range;

the method comprises the following steps of dividing a deterministic particle swarm optimization algorithm into two stages, wherein the first stage is a mutual attraction process of particles, and dividing the two stages into two steps, wherein firstly, when a population diversity value is larger than a certain proper threshold value, the particles gather towards the globally optimal particles along the direction of a negative gradient of a fitness function to the positions of the particles; when a certain optimal point neighborhood is searched, linear search is carried out by adopting a gradual descending strategy and continuously reducing the speed of particles; the two steps of the stage are respectively described by formula (2) and formula (3);

v_ij(t+1)＝w*gra(i，j)+c2*rand()*(p_g-x_ij(t)) (2)

v_ij(t+1)＝k*v_ij(t) (3)

the second phase is the mutual repulsion process of the particles; when the population diversity value is smaller than a preset threshold value, adaptively repelling the particles to improve the population diversity, and simultaneously searching the particles along the direction of the gradient and approaching other local optimal points; the larger the population diversity is, the smaller the dispersion speed of the population diversity is, and the smaller the population diversity is, the larger the dispersion speed of the population diversity is; the particle velocity update formula is as follows:

wherein diversity is the population diversity calculated by equation (6);