CN106971091B - Tumor identification method based on deterministic particle swarm optimization and support vector machine - Google Patents

Tumor identification method based on deterministic particle swarm optimization and support vector machine Download PDF

Info

Publication number
CN106971091B
CN106971091B CN201710122492.5A CN201710122492A CN106971091B CN 106971091 B CN106971091 B CN 106971091B CN 201710122492 A CN201710122492 A CN 201710122492A CN 106971091 B CN106971091 B CN 106971091B
Authority
CN
China
Prior art keywords
value
support vector
vector machine
classification
particle swarm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710122492.5A
Other languages
Chinese (zh)
Other versions
CN106971091A (en
Inventor
韩飞
李佳玲
凌青华
周从华
崔宝祥
宋余庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201710122492.5A priority Critical patent/CN106971091B/en
Publication of CN106971091A publication Critical patent/CN106971091A/en
Application granted granted Critical
Publication of CN106971091B publication Critical patent/CN106971091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Computing Systems (AREA)
  • Genetics & Genomics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a tumor identification method based on deterministic particle swarm optimization and a support vector machine, which comprises the steps of preprocessing tumor gene expression spectrum data, carrying out primary selection on information genes on a training set by using a classified information index method, and then removing redundant genes by using a pairwise redundant method to obtain an alternative gene library; obtaining a key base factor set on the training set by further using a classification information index method; and (3) optimizing the parameters of the support vector machine by using a deterministic particle swarm optimization algorithm on the training set, and then identifying the tumor gene expression profile data to be identified. The method provided by the invention optimizes the support vector machine by using deterministic particle swarm optimization on the basis of fully utilizing the characteristic that the support vector machine is suitable for small sample data identification, so that the performance of the support vector machine is further improved, and the tumor identification accuracy is further improved.

Description

Tumor identification method based on deterministic particle swarm optimization and support vector machine
Technical Field
The invention belongs to the field of application of computer analysis technology of tumor gene expression profile data, and particularly relates to a tumor identification method based on deterministic particle swarm optimization and a support vector machine.
Background
DNA microarray technology has brought enormous opportunities for biology, but the large amount and complexity of microarray data it generates presents enormous challenges to scholars in the relevant field, for four main reasons: first, microarray data contains a large amount of noise or outliers. Because noise and abnormal values are often generated in the experimental process, and errors or sample class marking errors are also caused in the data processing process, it is desirable to design a processing method with strong robustness. Secondly, the gene expression profile data is large in scale, and how to process a large-scale data set is also one of the difficulties to be solved. Therefore, it becomes very meaningful to design an efficient algorithm with low computational and spatial complexity. Third, microarray data is characterized by high dimensionality, low sample size. The classification operation scale of the gene expression profile data set grows exponentially along with the growth of gene data, so that how to deal with the problem of dimension disaster is also one of the difficulties. Fourthly, microarray data has nonlinear characteristics and hides a large amount of practical information. Therefore, it is important to transform classical statistical analysis methods into nonlinear analysis methods to process nonlinear data sets, and to use these methods to mine and deduce these potential biological information.
Since the beginning of 1999, Golub et al opened the field of tumor classification of gene expression profiles, scholars have successively proposed many classification methods based on gene expression profiles, some of which have been commonly used. Different classifiers can be designed according to different classification algorithms, such as Bayes, support vector machines, artificial neural networks and other classical classifiers, which can learn according to known sample class information to extract sample classification information. The experimental results of these classifiers in the field of tumor classification show that different classifiers have different classification capabilities for the same data set, that is, a good classifier has a difficulty in performing classification on all data sets. The SVM has the advantages of being suitable for processing high-dimensional small sample data, high in classification accuracy, strong in noise resistance and free of adjusting and inputting a large number of parameters. In addition, the method has scalability, namely the number of the support vectors is generally smaller after training, which is very effective for gene expression profiles with increasing matrix dimensions. Although the SVM is suitable for small sample data identification, the selection of parameters in the SVM is time-consuming, and no effective theory supports the selection of the parameters in the SVM at present, so that the classification performance of the SVM is influenced.
Particle Swarm Optimization (PSO) has good global search capability. Compared with a genetic algorithm, the PSO has the advantages of no complex genetic operation, few adjustable parameters, easy realization and the like, so that the PSO is widely applied in recent years. In the conventional PSO, the search time is long and the search performance needs to be improved due to the randomness of the particle search and the large number of blind searches. Therefore, deterministic search based on gradient search is introduced into a particle swarm optimization algorithm, and random search and deterministic search are combined to improve the search speed and accuracy of the population.
Disclosure of Invention
The invention aims at: the parameters of the support vector machine are optimized by using an improved particle swarm optimization (IGPSO), so that the searching performance of the support vector machine is improved, and the method is applied to the identification of tumor expression spectrum data to improve the tumor identification accuracy. Compared with the traditional tumor expression profile identification method, the method effectively improves the tumor identification accuracy.
The technical scheme is as follows: a tumor identification method based on deterministic particle swarm optimization and a support vector machine comprises the steps of screening a base factor set based on a classification information index and a pairwise redundancy method, and realizing the identification of tumor gene expression spectrum data by utilizing a deterministic particle swarm optimization (IGPSO) optimization support vector machine, and comprises the following steps:
step 1, preprocessing a tumor gene expression profile data set, namely dividing the tumor gene expression profile data set into a training set and a testing set, and then carrying out normalization processing on the data set to obtain a final key gene subset;
step 2, providing a deterministic particle swarm optimization algorithm (IGPSO), and optimizing a Support Vector Machine (SVM) by using the deterministic particle swarm optimization algorithm on a training set;
step 3, on the test set, using the support vector machine SVM obtained by optimization in the step 2 to identify the tumor gene expression profile data set;
further, the step 1 comprises the following steps:
step 1.1, dividing a tumor gene expression profile data set into a training set and a testing set;
step 1.2 the "classification information index" of each gene in the training set is calculated according to equation (1).
Figure BDA0001237452020000021
Wherein d (g) is a taxonomic information index of gene g,
Figure BDA0001237452020000022
respectively the mean value of the expression levels of the gene g in the two types of positive and negative samples,
Figure BDA0001237452020000023
and
Figure BDA0001237452020000024
the standard deviation of the expression level of the gene g in the two types of positive and negative samples is shown respectively.
Step 1.3 select all genes above a certain threshold (index of classification information) as the initially filtered set of genes.
Step 1.4 after preliminary filtering by using a classification information index method, calculating a Pearson correlation coefficient between two gene expression levels, selecting genes larger than a certain value, and reducing the size of the alternative gene library again.
Step 1.5 the taxonomic information index method is used again in the alternative gene library, selecting all genes above a certain threshold as the final key gene subset.
Further, the deterministic particle swarm optimization algorithm proposed in the step 2 comprises the following steps:
step 2.1, randomly initializing the position (x), the speed (v) and a population diversity threshold (sigma) of each function of the particle swarm in an initial range;
step 2.2 calculate the fitness value of each particle and the gradient of the fitness function at its location;
step 2.3, for each particle, comparing the adaptive value with the adaptive value of the best position which is experienced by an individual, and if the adaptive value is better, taking the adaptive value as the current optimal position;
step 2.4, for each particle, comparing the adaptive value with the adaptive value of the best position experienced by the group, and if the adaptive value is better, taking the adaptive value as the optimal position of the group;
step 2.5, when the population diversity value is larger than the set threshold value, updating the speed of each particle according to the formula (2), or updating according to the formula (5) and updating the positions of the particles;
the method comprises the following steps of dividing a deterministic particle swarm optimization algorithm into two stages, wherein the first stage is a mutual attraction process of particles, and the first stage can be divided into two steps, wherein firstly, when a population diversity value is larger than a certain proper threshold value, the particles gather towards globally optimal particles along a negative gradient direction of a fitness function to the positions of the particles; when a certain optimal point neighborhood is searched, a gradual descending strategy is adopted, and the speed of particles is continuously reduced to perform linear search. The two steps of this stage are described by the formulas (2) and (3), respectively.
vij(t+1)=w*gra(i,j)+c2*rand()*(pg-xij(t)) (2)
vij(t+1)=k*vij(t) (3)
Wherein Vi=(vi1,vi2,......,vin) Is the current flying speed, X, of the particle ii=(xi1,xi2,......,xin) Is the current position of the particle i, w is the inertial weight, pgK is a constant between (0,1) for the global best position; for the fitness function f (x), its corresponding negative gradient gra (i, j) is as follows:
Figure BDA0001237452020000031
the second phase is the process of mutual repulsion of the particles. When the population diversity value is less than a predetermined threshold, the particles are adaptively rejected to increase the population diversity while the particles are searched along the direction of the gradient and are approached to other local optimum points. Obviously, the larger the population diversity, the smaller the dispersion speed thereof, and the smaller the population diversity, the larger the dispersion speed thereof. The particle velocity update formula is as follows:
Figure BDA0001237452020000041
wherein diversity is the population diversity calculated by equation (6).
Figure BDA0001237452020000042
Wherein S is the population, | S | is the number of particles contained in the population, | L | is the longest radius of the search space, | N is the dimension of the problem, pijThe jth component of the ith particle.
And 2.6, if the end condition is not reached, turning to the step 2.2, otherwise, outputting an adaptive value.
Further, the optimizing the support vector machine by using the deterministic particle swarm optimization algorithm in the step 3 comprises the following steps:
step 3.1: setting the C, sigma parameter search space, x, of the SVMi,min≤xi≤xi,maxWhere C is a penalty factor, σ is a kernel parameter, xiI represents the number of parameters, and is set to be 2, and a parameter value x is randomly selected on a search space at the beginning of the algorithm;
the classification rule equation of the SVM is as follows (7):
Figure BDA0001237452020000043
training set T { (x)i,yi);xi∈Rn;yi± 1; 1, 2.., r }, wherein: x is the number ofiFor training samples, x is the sample to be judged, b is the threshold αiIs the Lagrange multiplier, K (x)iX) is a kernel function;
the optimization problem solved by the support vector machine and the constructed classification decision function are as follows:
Figure BDA0001237452020000044
Figure BDA0001237452020000045
Figure BDA0001237452020000046
wherein K (x, x)i) Is a kernel function, xiFor training samples, b is the threshold, αiIs a lagrange multiplier whose role is to map its feature space to a high dimensional space. In practical application, the number of characteristic genes is smallTherefore, the tumor samples are classified by using an SVM classifier based on RBF, which is expressed as follows:
Figure BDA0001237452020000047
step 3.2: setting the size of a particle swarm to be N, the requirement of classification accuracy to be F, the expansion factor to be Ex, and the local area size to be w ═ w1,w2]The maximum retry number is TmaxThe number of retries t and the expansion factor start to be 0;
step 3.3: the algorithm starts with the initialized search space p ═ p1,p2]Expanding the search space according to the expansion factor Ex, and calculating the local position according to the following formula so that the local position falls in the search space p +0.6Ex xw;
step 3.4: calculating the classification performance function f corresponding to xp
Step 3.5: searching an optimal value by using an IGPSO algorithm to obtain a classification performance function f corresponding to the optimal valuec
Step 3.6: if a better classification rate (f) is foundp<fc) If the sum of the t and the Ex is 0, otherwise, t is t + 1;
step 3.7: if T is more than or equal to TmaxIf yes, setting t to be 0, and if Ex +1, possibly trapping local optima, and increasing a search range so as to jump out a current local area;
step 3.8: if the requirement of classification accuracy is met (f)pF) is not more than the threshold value, the value of { C, sigma } and the classification accuracy are output, the algorithm is ended, otherwise, the step 3.3 is carried out.
Has the advantages that: the tumor gene expression profile data set of the high-dimensional small sample has a lot of useless data, and the support vector machine has a good generalization effect and is always used for data classification. However, the classification performance of the support vector machine depends on parameter selection, and the problem is not solved well all the time, and the application of the SVM is limited to a great extent. The particle swarm optimization algorithm for deterministic search carries out local search by means of gradient information, when a certain optimal point neighborhood is searched by the particles, the speed of the particles is continuously reduced, and the step length of the particles in linear search is controlled not to be too large; the global situation is controlled by applying diversity characteristics and an attraction and repulsion principle, when local optimization is achieved, particles are repelled in a self-adaptive mode to guarantee diversity, and finally a solution with high precision can be converged quickly.
Drawings
FIG. 1 is a block diagram of the present invention;
FIG. 2 is a flow chart of a deterministic particle swarm optimization algorithm in the present invention;
Detailed Description
A tumor identification method based on deterministic particle swarm optimization and a support vector machine comprises the steps of screening genes based on classification information indexes and pairwise redundancy methods, and utilizing a deterministic particle swarm optimization algorithm (IGPSO) to optimize the support vector machine for tumor gene identification, and comprises the following steps:
step 1, preprocessing a tumor gene expression profile data set, namely dividing the tumor gene expression profile data set into a training set and a testing set, and then carrying out normalization processing on the data set to obtain a final key gene subset;
step 2, providing a deterministic particle swarm optimization algorithm (IGPSO);
step 3, optimizing a Support Vector Machine (SVM) by using a deterministic particle swarm optimization algorithm on a training set;
step 4, on the test set, using the support vector machine SVM obtained by optimization in the step 3 to identify a tumor gene expression profile data set;
further, the step 1 comprises the following steps:
step 1.1, dividing a tumor gene expression profile data set into a training set and a testing set;
step 1.2 the "classification information index" of each gene in the training set is calculated according to equation (1).
Figure BDA0001237452020000061
Wherein d (g) is a taxonomic information index of gene g,
Figure BDA0001237452020000062
respectively the mean value of the expression levels of the gene g in the two types of positive and negative samples,
Figure BDA0001237452020000063
and
Figure BDA0001237452020000064
the standard deviation of the expression level of the gene g in the two types of positive and negative samples is shown respectively.
Step 1.3 select all genes above a certain threshold (index of classification information) as the initially filtered set of genes.
Step 1.4 after preliminary filtering by using a classification information index method, calculating a Pearson correlation coefficient between two gene expression levels, selecting genes larger than a certain value, and reducing the size of the alternative gene library again.
Step 1.5 the taxonomic information index method is used again in the alternative gene library, selecting all genes above a certain threshold as the final key gene subset.
Further, the step 2 comprises the following steps:
step 2.1, randomly initializing the position (x), the speed (v) and a population diversity threshold (sigma) of each function of the particle swarm in an initial range;
step 2.2 calculate the fitness value of each particle and the gradient of the fitness function at its location;
step 2.3, for each particle, comparing the adaptive value with the adaptive value of the best position which is experienced by an individual, and if the adaptive value is better, taking the adaptive value as the current optimal position;
step 2.4, for each particle, comparing the adaptive value with the adaptive value of the best position experienced by the group, and if the adaptive value is better, taking the adaptive value as the optimal position of the group;
step 2.5, when the population diversity value is larger than the set threshold value, updating the speed of each particle according to the formula (2), or updating according to the formula (5) and updating the positions of the particles;
the method comprises the following steps of dividing a deterministic particle swarm optimization algorithm into two stages, wherein the first stage is a mutual attraction process of particles, and the first stage can be divided into two steps, wherein firstly, when a population diversity value is larger than a certain proper threshold value, the particles gather towards globally optimal particles along a negative gradient direction of a fitness function to the positions of the particles; when a certain optimal point neighborhood is searched, a gradual descending strategy is adopted, and the speed of particles is continuously reduced to perform linear search. The two steps of this stage are described by the formulas (2) and (3), respectively.
vij(t+1)=w*gra(i,j)+c2*rand()*(pg-xij(t)) (2)
vij(t+1)=k*vij(t) (3)
Wherein Vi=(vi1,vi2,......,vin) Is the current flying speed, X, of the particle ii=(xi1,xi2,......,xin) Is the current position of the particle i, w is the inertial weight, pgK is a constant between (0,1) for the global best position; for the fitness function f (x), its corresponding negative gradient gra (i, j) is as follows:
Figure BDA0001237452020000071
the second phase is the process of mutual repulsion of the particles. When the population diversity value is less than a predetermined threshold, the particles are adaptively rejected to increase the population diversity while the particles are searched along the direction of the gradient and are approached to other local optimum points. Obviously, the larger the population diversity, the smaller the dispersion speed thereof, and the smaller the population diversity, the larger the dispersion speed thereof. The particle velocity update formula is as follows:
Figure BDA0001237452020000072
wherein diversity is the population diversity calculated by equation (6).
Figure BDA0001237452020000073
Wherein S is the population, | S | is the number of particles contained in the population, | L | is the longest radius of the search space, | N is the dimension of the problem, pijThe jth component of the ith particle.
And 2.6, if the end condition is not reached, turning to the step 2.2, otherwise, outputting an adaptive value.
Further, the step 3 comprises the following steps:
step 3.1: setting the C, sigma parameter search space, x, of the SVMi,min≤xi≤xi,maxWhere C is a penalty factor, σ is a kernel parameter, xiI represents the number of parameters, and is set to be 2, and a parameter value x is randomly selected on a search space at the beginning of the algorithm;
the classification rule equation of the SVM is as follows (7):
Figure BDA0001237452020000081
training set T { (x)i,yi);xi∈Rn;yi± 1; 1, 2.., r }, wherein: x is the number ofiFor training samples, x is the sample to be judged, b is the threshold αiIs the Lagrange multiplier, K (x)iX) is a kernel function;
the optimization problem solved by the support vector machine and the constructed classification decision function are as follows:
Figure BDA0001237452020000082
Figure BDA0001237452020000083
Figure BDA0001237452020000084
wherein K (x, x)i) Is a kernel function, xiFor training samples, b is the threshold, αiIs a lagrange multiplier whose role is to map its feature space to a high dimensional space. In practical application, the number of characteristic genes is small, so an SVM classifier based on RBF is adopted to classify the tumor samples, and the RBF is expressed as follows:
Figure BDA0001237452020000085
step 3.2: setting the size of a particle swarm to be N, the requirement of classification accuracy to be F, the expansion factor to be Ex, and the local area size to be w ═ w1,w2]The maximum retry number is TmaxThe number of retries t and the expansion factor start to be 0;
step 3.3: the algorithm starts with the initialized search space p ═ p1,p2]Expanding the search space according to the expansion factor Ex, and calculating the local position according to the following formula so that the local position falls in the search space p +0.6Ex xw;
step 3.4: calculating the classification performance function f corresponding to xp
Step 3.5: searching an optimal value by using an IGPSO algorithm to obtain a classification performance function f corresponding to the optimal valuec
Step 3.6: if a better classification rate (f) is foundp<fc) If the sum of the t and the Ex is 0, otherwise, t is t + 1;
step 3.7: if T is more than or equal to TmaxIf t is 0, Ex +1, in this casePossibly getting into local optimum, and increasing the search range so as to jump out the current local area;
step 3.8: if the requirement of classification accuracy is met (f)pF) is not more than the threshold value, the value of { C, sigma } and the classification accuracy are output, the algorithm is ended, otherwise, the step 3.3 is carried out.
The following is a brief description of the implementation of the present invention, taking tumor gene expression profile data as an example. The present example selects a colon cancer tumor expression profile dataset comprising a total of 62 samples, each represented by expression level values for 2000 genes. These 62 genes contained 22 normal samples and 40 tumor samples. On the data set, the specific implementation steps of the invention are as follows:
as shown in fig. 1, a tumor identification method based on deterministic particle swarm optimization and support vector machine includes the steps of screening genes based on classification information index and pairwise redundancy method, and utilizing deterministic particle swarm optimization (IGPSO) to optimize support vector machine for tumor gene identification, including the following steps:
(1) the data set is divided into a training set and a testing set, and each gene is calculated on the training set by adopting an improved signal-to-noise ratio formula in a classified information index method. The larger the information index of the gene is, the more corresponding sample classification information is contained in the gene, and the classification capability of the gene to the sample is also stronger correspondingly. Table 1 shows the classification information distribution of the colon cancer data set. 173 genes with an informative index greater than 0.5 were selected in the colon cancer dataset as the subset of genes analyzed below.
(2) Redundant genes were excluded by calculating the Pearson correlation coefficient between the expression levels of the two genes. Colon cancer data was analyzed using 173 genes selected by the above method. Through pairwise redundancy calculation and comparison, 59 genes are finally obtained.
(3) And calculating the obtained gene subsets according to an improved signal-to-noise ratio formula in the classification information indexes, and selecting 11 genes of the colon cancer data set information indexes as final key gene subsets. Table 2 shows the key gene subset to be screened for colon cancer.
(4) The two parameters of the support vector machine are initially set, the search range {0 < C < 16,0 < sigma < 6}, the maximum retry number is set to be 10, the expanding step length of C is 0.3, and the expanding step length of sigma is 0.1. The IGPSO algorithm searches the gradient direction of the parameter { C, sigma } according to the (performance function) classification rate of the particles in a local area, and if the maximum retry number is reached and a better classification rate is not found, the search range is expanded.
(5) On the test set, the SVM is classified after being optimized by an IGPSO algorithm. Table 3 shows the classification of colon cancer samples.
Table 1 shows the colon cancer classification information index distribution.
TABLE 1 Colon cancer Classification information index distribution in the present invention
Index of genetic information Number of genes Accounts for 2000 genes
0.0~0.3 1524 76.2
0.3~0.5 303 15.15
0.5~1.897 173 8.65
Table 2 shows the subset of colon cancer genes to be classified
TABLE 2 Colon cancer gene sets of the present invention
Figure BDA0001237452020000101
Table 3 shows the classification condition in the colon cancer sample in the present invention, when the penalty factor C is small, the classification error rate is high, when C increases, the classification error rate decreases sharply, i.e., the classification performance increases rapidly, while when C continues to increase, the change of classification performance is not obvious, and when C increases to a certain value, the performance does not change with the change of C any more, i.e., the SVM is insensitive to C in a large range. In the experiment, the classification accuracy of C in the range of (6, 15) is high, namely, the SVM is not sensitive to C in the region. Experiments show that in the state of the best classification effect, the value of sigma is properly reduced, namely, the sigma is properly corrected, so that the classification accuracy can be obviously improved. The final experiment result shows that the value of sigma between (0.9 and 1.88) has good classification effect.
TABLE 3 Classification of the invention on Colon cancer samples
Figure BDA0001237452020000102
Table 4 shows a comparison of the proposed method of the present invention with the SVM related method.
TABLE 4 comparison of the method of the invention with methods related to SVM
Figure BDA0001237452020000103
Figure BDA0001237452020000111
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (3)

1. A tumor identification method based on deterministic particle swarm optimization and a support vector machine is characterized by comprising the following steps:
step 1, preprocessing a tumor gene expression profile data set, namely dividing the tumor gene expression profile data set into a training set and a testing set, and then carrying out normalization processing on the data set to obtain a final key gene subset; step 2, providing a deterministic particle swarm optimization (IGPSO), and optimizing a Support Vector Machine (SVM) by using the deterministic particle swarm optimization (IGPSO) on a training set; step 3, on the test set, using the support vector machine SVM obtained by optimization in the step 2 to identify the tumor gene expression profile data set;
the step 2 of optimizing the support vector machine SVM by using the deterministic particle swarm optimization algorithm comprises the following steps:
step 3.1: setting the C, sigma parameter search space, m, of the SVMi,min≤mi≤mi,maxWherein the penalty factor C, the kernel parameter σ, miThe value of the ith parameter is represented by i, the number of the parameters is set to be 2, and a parameter value m is randomly selected on a search space at the beginning of the algorithm; the classification rule equation of the SVM is as follows (7):
Figure FDA0002479102700000011
training set T { (x)i,yi);xi∈Rn;yi± 1; 1, 2.., r }, wherein: x is the number ofiFor training samples, x is the sample to be judged, b is the threshold αiIs the Lagrange multiplier, K (x)iX) is a kernel function;
the optimization problem solved by the support vector machine and the constructed classification decision function are as follows:
Figure FDA0002479102700000012
Figure FDA0002479102700000013
Figure FDA0002479102700000014
wherein K (x, x)i) Is a kernel function, xiFor training samples, b is the threshold, αiIs a lagrange multiplier which has the function of mapping the characteristic space thereof to a high-dimensional space; in practical application, the number of characteristic genes is small, so an SVM classifier based on RBF is adopted to classify the tumor samples, and the RBF is expressed as follows:
Figure FDA0002479102700000015
step 3.2: setting the size of a particle swarm to be N, the requirement of classification accuracy to be F, the expansion factor to be Ex, and the local area size to be w ═ w1,w2]The maximum retry number is TmaxThe number of retries t and the expansion factor start to be 0;
step 3.3: the algorithm starts with the initialized search space p ═ p1,p2]Expanding the search space according to the expansion factor Ex, and calculating local positions according to the steps 3.4 to 3.7 so that the local positions fall into the search space p +0.6Ex xw;
step 3.4:calculating the classification performance function f corresponding to xp
Step 3.5: searching an optimal value by using an IGPSO algorithm to obtain a classification performance function f corresponding to the optimal valuec
Step 3.6: if a better classification rate is found, i.e. fp<fcIf the sum of the t and the Ex is 0, otherwise, t is t + 1;
step 3.7: if T is more than or equal to TmaxIf yes, setting t to be 0, and if Ex +1, possibly trapping local optima, and increasing a search range so as to jump out a current local area;
step 3.8: if the requirement of classification accuracy is met, fpIf not more than F, outputting the value of { C, sigma } and the classification accuracy, ending the algorithm, otherwise, turning to the step 3.3;
the two parameters of the support vector machine are initially set, the search range {0 < C < 16,0 < sigma < 6}, the maximum retry number is set to 10, the expansion step length of C is 0.3, the expansion step length of sigma is 0.1, the two parameters of the support vector machine are optimized by an IGPSO algorithm in combination with a final key gene subset, the IGPSO algorithm searches the gradient direction of the parameters { C, sigma } according to the classification rate of a performance function in a local area, and if the maximum retry number is reached, a better classification rate is not found, the search range is expanded.
2. The method for tumor identification based on deterministic particle swarm optimization and support vector machine according to claim 1, wherein the step 1 comprises the following steps:
step 1.1, dividing a tumor gene expression profile data set into a training set and a testing set;
step 1.2 according to the formula (1), calculating the classification information index of each gene in the training set;
Figure FDA0002479102700000021
wherein d (g) is a taxonomic information index of gene g,
Figure FDA0002479102700000022
respectively the mean value of the expression levels of the gene g in the two types of positive and negative samples,
Figure FDA0002479102700000023
and
Figure FDA0002479102700000024
respectively is the standard deviation of the expression level of the gene g in the two types of positive and negative samples;
step 1.3, selecting all genes larger than a certain classification information index threshold value as a gene set after preliminary filtration;
step 1.4, after preliminary filtering by using a classification information index method, calculating a Pearson correlation coefficient between two gene expression levels, selecting a gene set larger than a certain value, and reducing the size of the alternative gene library again;
step 1.5 to further narrow the scope of the key gene set, the classification information index method is used again in the alternative gene library, and all genes larger than a certain threshold are selected as the final key gene subset.
3. The method for tumor identification based on deterministic particle swarm optimization and support vector machine according to claim 1, wherein the step 2 of proposing the deterministic particle swarm optimization algorithm IGPSO comprises the following steps:
step 2.1, randomly initializing the position X and the speed v of the particle swarm and a population diversity threshold sigma of each function in an initial range;
step 2.2 calculate the fitness value of each particle and the gradient of the fitness function at its location;
step 2.3, for each particle, comparing the adaptive value with the adaptive value of the best position which is experienced by an individual, and if the adaptive value is better, taking the adaptive value as the current optimal position;
step 2.4, for each particle, comparing the adaptive value with the adaptive value of the best position experienced by the group, and if the adaptive value is better, taking the adaptive value as the optimal position of the group;
step 2.5, when the population diversity value is larger than the set threshold value, updating the speed of each particle according to the formula (2), or updating according to the formula (5) and updating the positions of the particles;
the method comprises the following steps of dividing a deterministic particle swarm optimization algorithm into two stages, wherein the first stage is a mutual attraction process of particles, and dividing the two stages into two steps, wherein firstly, when a population diversity value is larger than a certain proper threshold value, the particles gather towards the globally optimal particles along the direction of a negative gradient of a fitness function to the positions of the particles; when a certain optimal point neighborhood is searched, linear search is carried out by adopting a gradual descending strategy and continuously reducing the speed of particles; the two steps of the stage are respectively described by formula (2) and formula (3);
vij(t+1)=w*gra(i,j)+c2*rand()*(pg-xij(t)) (2)
vij(t+1)=k*vij(t) (3)
wherein Vi=(vi1,vi2,......,vin) Is the current flying speed, X, of the particle ii=(xi1,xi2,......,xin) Is the current position of the particle i, w is the inertial weight, pgK is a constant between (0,1) for the global best position; for the fitness function f (x), its corresponding negative gradient gra (i, j) is as follows:
Figure FDA0002479102700000031
the second phase is the mutual repulsion process of the particles; when the population diversity value is smaller than a preset threshold value, adaptively repelling the particles to improve the population diversity, and simultaneously searching the particles along the direction of the gradient and approaching other local optimal points; the larger the population diversity is, the smaller the dispersion speed of the population diversity is, and the smaller the population diversity is, the larger the dispersion speed of the population diversity is; the particle velocity update formula is as follows:
Figure FDA0002479102700000041
wherein diversity is the population diversity calculated by equation (6);
Figure FDA0002479102700000042
wherein S is the population, | S | is the number of particles contained in the population, | L | is the longest radius of the search space, | N is the dimension of the problem, pijThe jth component of the ith particle;
and 2.6, if the end condition is not reached, turning to the step 2.2, otherwise, outputting an adaptive value.
CN201710122492.5A 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine Active CN106971091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710122492.5A CN106971091B (en) 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710122492.5A CN106971091B (en) 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine

Publications (2)

Publication Number Publication Date
CN106971091A CN106971091A (en) 2017-07-21
CN106971091B true CN106971091B (en) 2020-08-28

Family

ID=59328372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710122492.5A Active CN106971091B (en) 2017-03-03 2017-03-03 Tumor identification method based on deterministic particle swarm optimization and support vector machine

Country Status (1)

Country Link
CN (1) CN106971091B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107643948B (en) * 2017-09-30 2020-06-02 Oppo广东移动通信有限公司 Application program control method, device, medium and electronic equipment
CN108629158A (en) * 2018-05-14 2018-10-09 浙江大学 A kind of intelligent Lung Cancer cancer cell detector
CN108875305A (en) * 2018-05-14 2018-11-23 浙江大学 A kind of leukaemia cancer cell detector of colony intelligence optimizing
CN109947941A (en) * 2019-03-05 2019-06-28 永大电梯设备(中国)有限公司 A kind of method and system based on elevator customer service text classification
CN110060740A (en) * 2019-04-16 2019-07-26 中国科学院深圳先进技术研究院 A kind of nonredundancy gene set clustering method, system and electronic equipment
CN111383710A (en) * 2020-03-13 2020-07-07 闽江学院 Gene splice site recognition model construction method based on particle swarm optimization gemini support vector machine
CN111582370B (en) * 2020-05-08 2023-04-07 重庆工贸职业技术学院 Brain metastasis tumor prognostic index reduction and classification method based on rough set optimization
CN113707216A (en) * 2021-08-05 2021-11-26 北京科技大学 Infiltration immune cell proportion counting method
CN113808659B (en) * 2021-08-26 2023-06-13 四川大学 Feedback phase regulation and control method based on gene gradient particle swarm algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258244A (en) * 2013-04-28 2013-08-21 西北师范大学 Method for predicting inhibiting concentration of pyridazine HCV NS5B polymerase inhibitor based on particle swarm optimization support vector machine
CN105372202A (en) * 2015-10-27 2016-03-02 九江学院 Genetically modified cotton variety recognition method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258244A (en) * 2013-04-28 2013-08-21 西北师范大学 Method for predicting inhibiting concentration of pyridazine HCV NS5B polymerase inhibitor based on particle swarm optimization support vector machine
CN105372202A (en) * 2015-10-27 2016-03-02 九江学院 Genetically modified cotton variety recognition method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种改进的基于梯度搜索的粒子群优化算法;韩飞 等;《南京大学学报(自然科学)》;20130331;第49卷(第2期);第196-201页 *
一种改进的基于粒子群优化的SVM训练算法;童燕 等;《计算机工程与应用》;20081231;第44卷(第20期);第138-141页 *
支持向量机分类模型下的肿瘤基因辨识研究;郝爱丽;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130815;I138-528 *

Also Published As

Publication number Publication date
CN106971091A (en) 2017-07-21

Similar Documents

Publication Publication Date Title
CN106971091B (en) Tumor identification method based on deterministic particle swarm optimization and support vector machine
CN110197286B (en) Active learning classification method based on Gaussian mixture model and sparse Bayes
Miao et al. A survey on feature selection
Zeng et al. An adaptive meta-clustering approach: combining the information from different clustering results
Alomari et al. A hybrid filter-wrapper gene selection method for cancer classification
CN104991974A (en) Particle swarm algorithm-based multi-label classification method
CN109508752A (en) A kind of quick self-adapted neighbour&#39;s clustering method based on structuring anchor figure
Chen et al. Uncorrelated lasso
Dara et al. A binary PSO feature selection algorithm for gene expression data
CN106548041A (en) A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
Wang et al. A density weighted fuzzy outlier clustering approach for class imbalanced learning
Javani et al. Clustering and feature selection via PSO algorithm
Kumar et al. Microarray data classification using fuzzy K-nearest neighbor
CN105760471A (en) Classification method for two types of texts based on multiconlitron
Wong et al. A probabilistic mechanism based on clustering analysis and distance measure for subset gene selection
Chuang et al. Chaotic binary particle swarm optimization for feature selection using logistic map
Mukhopadhyay et al. Gene expression data analysis using multiobjective clustering improved with SVM based ensemble
Jiang et al. A study of the Naive Bayes classification based on the Laplacian matrix
Cebron et al. Adaptive active classification of cell assay images
CN113688229B (en) Text recommendation method, system, storage medium and equipment
CN113177604B (en) High-dimensional data feature selection method based on improved L1 regularization and clustering
Guo et al. A comparison between the wrapper and hybrid methods for feature selection on biology Omics datasets
Chuang et al. Feature selection using complementary particle swarm optimization for DNA microarray data
Jaisingh et al. Gene Selection by Hybrid Feature Selection Approaches and Classification Techniques in Microarray Dataset for Cancer Prediction
Jang et al. Classifying Single-Cell Types from Mouse Brain RNA-Seq Data using Machine Learning Algorithms

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant