CN110782950A - Tumor key gene identification method based on preference grid and Levy flight multi-target particle swarm algorithm - Google Patents

Tumor key gene identification method based on preference grid and Levy flight multi-target particle swarm algorithm Download PDF

Info

Publication number
CN110782950A
CN110782950A CN201910903327.2A CN201910903327A CN110782950A CN 110782950 A CN110782950 A CN 110782950A CN 201910903327 A CN201910903327 A CN 201910903327A CN 110782950 A CN110782950 A CN 110782950A
Authority
CN
China
Prior art keywords
particle
gene
grid
genes
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910903327.2A
Other languages
Chinese (zh)
Other versions
CN110782950B (en
Inventor
韩飞
管天华
孙郁闻天
方升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201910903327.2A priority Critical patent/CN110782950B/en
Publication of CN110782950A publication Critical patent/CN110782950A/en
Application granted granted Critical
Publication of CN110782950B publication Critical patent/CN110782950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a tumor key gene identification method based on a preference grid and Levy flight multi-target particle swarm algorithm, which comprises the steps of filtering an original gene expression profile data set by using a classification information index to obtain a primary gene pool; calculating the gene category sensitivity information GCS value of each gene in the initial gene pool, and then coding the particles through the GCS value; constructing a multi-objective optimization model by taking the classification accuracy of the basis factor set on the extreme learning machine ELM and the scale of the basis factor subset as targets; and searching out a final gene subset through the established multi-target model, and identifying the key genes of the tumor. In the aspect of a multi-objective optimization model, the method can quickly and efficiently identify the key gene subsets with small number and good classification performance in the primary gene pool through the multi-objective model.

Description

Tumor key gene identification method based on preference grid and Levy flight multi-target particle swarm algorithm
Technical Field
The invention belongs to the field of application of computer analysis technology of tumor gene expression profile data, and particularly relates to a multi-target particle swarm optimization tumor key gene identification method based on preference grids and Levy flight.
Background
Microarray technology has been widely used for disease diagnosis since the eighties of the last century. It can help medical and research staff to access expression levels of thousands of genes simultaneously, ultimately producing microarray data. Classification and prediction of diagnostic classes of samples by gene expression profiling, these data have been successfully applied to the classification of cancer. However, complex gene expression profiling data still faces many challenges in developing effective classifiers: first, the dimensionality of gene expression profile data is high, and each dimensionality and gene have a complex and unknown relationship. Second, there are a large number of unrelated samples in the gene expression profile dataset. Third, the sample size of the gene expression profile dataset is small, which results in higher computational complexity and more prediction error.
Key gene recognition, i.e., gene selection, also known as feature selection, can be considered an effective method to improve the predictive performance of models. It is a key preprocessing step in data mining that focuses on identifying the optimal subset of genes from the expression dataset by reducing redundant, unrelated or noisy genes. Gene selection can be largely classified into a filtration method, a winding method and a mixing method, depending on how the correlation of each gene with a target class is evaluated. The filtering method does not use a classifier to evaluate the subset of genes, and most filtering methods do not consider the correlation between genes. The wrapping method integrates a predetermined learning algorithm with a classifier to group the optimal subset of genes according to prediction accuracy. Although filtration is more efficient than winding, the classification performance of the latter is much better than the former. The mixing method is a combination of the filtration method and the encapsulation method, and their advantages are utilized in a complementary manner. However, these methods generally view gene selection as a single-target problem. The main drawback is the difficulty to explore different potential tradeoffs between classification accuracy and different subsets of selected genes.
The Particle Swarm Optimization (PSO) has strong global searching capability and high convergence rate. Compared with a genetic algorithm, PSO does not need to carry out complex genetic operation, has fewer adjustable parameters and is easy to realize, so that PSO is widely applied to key gene identification of tumor expression profile data in recent years. In general, tumor key gene identification is a multi-objective problem that involves minimizing the size of a subset of genes and maximizing predictive performance. A velocity constraint multi-target particle swarm optimization (SMPSO) adds a velocity constraint mechanism, and when the velocity of a particle is too high, the velocity constraint mechanism can limit the velocity of the particle to be too high to cause a population explosion phenomenon. A multi-objective particle swarm algorithm (CMOPSO) based on a competition mechanism updates particles based on a pairwise competition rather than by conventional individually-optimal and globally-optimal particle updates. The methods improve the convergence and diversity of the algorithm to a certain extent, but the performance of the algorithm is often reduced when the complex multi-target problem is faced, such as a non-convex problem or a multi-modal problem. Furthermore, these multi-objective optimization algorithms aim at searching all Pareto optimal solutions, assuming that all non-dominant solutions are advisable. In practice, the main purpose of key gene identification is to enhance the classification performance of the classifier. Thus, key gene identification may prefer to search for those regions where the solution exhibits better predictive performance, rather than those regions with fewer genes at the pareto frontier. From this perspective, these methods waste computational costs when searching for solutions that are not needed.
Disclosure of Invention
The purpose of the invention is as follows: the method can identify the basis factor sets highly related to the tumor types, has fewer selected basis factor sets, and has stronger interpretability compared with the traditional method.
The technical scheme is as follows: a tumor key gene identification method based on a multi-target particle swarm algorithm of preference grids and Levy flight comprises the steps of carrying out primary selection on original genes by utilizing classification information indexes, then utilizing GCS information to encode particles, and utilizing the multi-target particle swarm algorithm based on the preference grids and the Levy flight to search key tumor genes, and comprises the following steps:
step 1, preprocessing gene expression profile data, including dividing an original data set into a training set and a testing set, and filtering the original gene expression profile data set by using a classification information index to obtain an initial gene pool;
step 2, calculating the gene category sensitivity information GCS value of each gene in the initial gene pool, and then coding the particles through the GCS value;
step 3, establishing a multi-objective optimization model by taking the classification accuracy of the gene subsets on the extreme learning machine ELM and the scale of the gene subsets as targets;
step 4, providing a multi-target particle swarm algorithm (MOPSO-PAG-LF) based on a preference grid and Levy flight, and continuously searching, evaluating and updating particles and maintaining an external archive by using the MOPSO-PAG-LF so as to obtain a gene subset with higher classification accuracy and smaller scale;
step 5, outputting the finally identified tumor key genes if the termination condition is met, otherwise, turning to step 4;
further, the step 1 comprises the following steps:
step 1.1 load original gene dataset and follow 2: 1, dividing a training set and a test set in proportion;
step 1.2 according to the formula (1), calculating the classification information index of each gene, arranging the classification information index in a descending order, and selecting the first 400 genes to add into an initial gene pool.
Figure BDA0002210823720000021
wherein ,
Figure BDA0002210823720000038
and
Figure BDA0002210823720000039
represents the mean of the expression levels of gene g in positive (+) and negative (-) classes, and
Figure BDA00022108237200000311
the standard deviations of the expression levels of gene g in positive (+) and negative (-) classes, respectively.
Further, the step 2 comprises the following steps:
step 2.1 calculating the GCS value of each gene in the primary gene pool according to the formula (2) and the formula (3), wherein the bigger the GCS value is, the bigger the contribution of the gene with the smaller GCS value to the classification is;
Figure BDA0002210823720000035
Figure BDA0002210823720000036
wherein XTrainingFor training the sample set, β sqIs the weight, w, of the s-th hidden layer node and the q-th output node of the ELM jsIs the weight of the jth input node and the s-th hidden layer node; hid(s) is the input to the s-th hidden layer node; n is a radical of gnlThe number of genes in an initial gene pool, g is an activation function of ELM, and the sigmoid function is taken.
And 2.2, encoding the particles, namely, firstly, carrying out descending order arrangement on all genes according to GCS values, randomly initializing the first 20 percent of genes to random numbers in [0, 1], initializing the rest 80 percent of genes to 0, indicating that the gene corresponding to a certain dimension is selected when the value of the position of the particle on the dimension is more than 0.5, and otherwise indicating that the gene corresponding to the dimension is not selected when the value is less than 0.5.
Further, the step 3 comprises the following steps:
step 3.1, setting evaluation indexes of the multi-target particle swarm algorithm, wherein the evaluation indexes comprise two indexes: accuracy and gene scale. f. of 1Is the accuracy acc (i), which is the ELM classification accuracy of the ith particle on the validation set, f 2For the gene scale genenum (i), the number of genes selected for particle i, to unify the two indices into a maximization problem, genenum (i) is changed to
Figure BDA0002210823720000037
d is the dimension of the sample.
Step 3.2 changing f to (f) 1,f 2) The method is used as an optimization target of the multi-target particle swarm algorithm.
Further, the step 4 comprises the following steps:
step 4.1, randomly initializing population particles, and adding a parameter flag to each particle, wherein the parameter is used for judging how long each particle has not evolved into a better particle;
step 4.2, whether the parameter flag of each particle is smaller than a preset threshold value T or not; (ii) a
4.3 if the value is less than T, evolving the particle according to the formulas (4) and (5), namely a conventional particle swarm algorithm formula, and if the value is more than T, evolving the particle according to the formulas (6) and (7) and (8) by using an improved Levy flight strategy on the particle, wherein the flag value of the particle is changed into 0;
Figure BDA0002210823720000042
Figure BDA0002210823720000043
Figure BDA0002210823720000044
here u and v follow a normal distribution:
Figure BDA0002210823720000045
and
Figure BDA0002210823720000046
wherein ,
Figure BDA0002210823720000047
is the velocity of particle i at the t +1 th iteration,
Figure BDA0002210823720000048
for particle i at the t-th timePosition of iteration, x pb,iFor the individual historical optimal position, x, of particle i gb,iFor the global optimal position of particle i, w is the inertial weight, typically [0.4, 0.9]]Adaptive change between c 1,c 2Is an acceleration constant, r 1,r 2Two are in [0, 1]]Generally speaking, parameter α is usually set to 0.01 to prevent it from stepping too aggressively and easily jumping out of the decision boundary, β is set to 1.5. Note that, when step S is updated, the present invention makes some perturbation to the conventional Lewy flight formula, where there is some probability of multiplying S by the global optimum particle x gb,iSubtracting the position of the current particle The purpose of this is that the particle can be properly moved to the globally optimal particle x when the position of the particle is updated with the lave flight gb,iThe random jumps are directionally dependent, rather than perfectly aligned with the lave distribution.
Step 4.4 with f ═ f (f) 1,f 2) As a target function, evaluating whether the particle evolves into a better solution, judging the domination relationship between the newly generated solution and the individual optimal particle, if the new particle dominates the individual optimal particle, updating the individual optimal information of the particle and setting a parameter flag of the particle to 0; if the new particle is dominated by the individual optimal particle, the value of the attribute flag of the particle is increased by 1; if the new particle and the individual optimal particle are not mutually independent, updating the individual optimal information of the particle with a certain probability (50%) and setting the parameter flag of the particle to be 0, and conversely, adding 1 to the attribute flag value of the particle.
Step 4.5, dominant comparison is carried out on the particles, non-dominant solutions are added into an external archive, and maintenance is carried out on the external archive. When maintaining external archives and selecting leader particles, the invention is carried out in a mode of preferring grids, and specifically comprises the following steps: a grid as shown in FIG. 1 is first created from the values of the non-dominant solutions in the external archive on the objective function, each representing a black point Q in the grid iSo that Q is { Q ═ Q 1,Q 2,...,Q i,...,Q nDenotes the set of non-dominant solutions, n is the number of non-dominant solutions, and the grid of at least one particle in the grid is referred to herein as the active grid.
For Q iBelongs to Q, and calculates Q according to formula (9) iOf the weighted fitness value of, wherein F 1,F 2Is the fitness value of two targets, α is [0, 1]]Preference weights within, depending on F 1 and F2For the importance of the problem, the decision maker decides the parameter, and the invention sets α to 0.7, β is 1- α, and num is Q iThe number of particles in the grid, θ, is a penalty term, and is set to 0.05.
λ i=α*F 1+β*F 2-θ*num (9)
When the leader particle is selected, Q is calculated according to equation (10) iProbability of being selected P iWhen the particle to be deleted is maintained in the external archive, Q is calculated according to equation (11) iProbability of being selected P iWhere n is the total number of non-dominated solutions, and then a particle is selected as the leader particle or deleted from the archive using the roulette method. Note that here for each lambda iAre raised to the exponential power of e, the purpose of which is to let λ be iThe larger particles have larger probability to be selected, and the lambda is further enlarged iLarge particle and lambda iSmall probability of hits between particles. From λ iIt can be seen that when Q is iWhen the number of particles in the grid is large, the obtained fitness value lambda is iDue to the existence of the punishment item, the punishment item is smaller, so that the selected solution not only has higher classification accuracy, but also can be sparse in the grid, the decision efficiency of the algorithm is greatly improved, and the expenditure of computing resources is saved.
Figure BDA0002210823720000051
Figure BDA0002210823720000052
Step 4.6, judging whether the multi-target particle algorithm meets the termination condition, and if so, outputting a result; if not, turning to the step (4.2)
Further, the step 5 comprises the following steps:
step 5.1, repeating the above operations until the fitness function reaches a certain threshold or reaches a preset maximum iteration number, otherwise, returning to the step 4;
step 5.2 the non-dominant particles in the archive at this point may each represent the final selected subset of critical genes identifying the tumor.
Has the advantages that: variation and noise exist in the tumor gene expression profile data of the high-dimensional small sample, and a large amount of useful information is hidden. The PSO algorithm of the traditional method is easy to fall into a local minimum point, so that the selected basic factor set is not optimal. The invention constructs a grid capable of describing decision preference by a weighting method to maintain files and select leader particles, thereby greatly improving the decision efficiency of the algorithm and saving the expense of computing resources; meanwhile, an improved Levis flight strategy is combined with a multi-target particle swarm algorithm, and the convergence performance of the algorithm on the complex multi-target optimization problem is improved.
A multi-target particle swarm algorithm (MOPSO-PAG-LF) based on a preference grid and Levy flight is provided, updated particles are continuously searched and evaluated by the multi-target particle swarm algorithm, an external archive is maintained, a gene subset with high classification accuracy and small scale can be obtained, and compared with a traditional tumor key gene identification method, the classification identification method can identify two specific subtype tumor key genes in a primary gene pool through an improved multi-target model.
Drawings
FIG. 1 is a schematic diagram of a preference grid of the present invention;
FIG. 2 is a block diagram of the architecture of the present invention;
Detailed Description
A tumor key gene identification method based on a multi-target particle swarm algorithm of preference grids and Levy flight comprises the steps of carrying out primary selection on original genes by utilizing classification information indexes, then utilizing GCS information to encode particles, and utilizing the multi-target particle swarm algorithm based on the preference grids and the Levy flight to search key tumor genes, wherein the method specifically comprises the following steps:
step 1, preprocessing gene expression profile data, including dividing an original data set into a training set and a testing set, and filtering the original gene expression profile data set by using a classification information index to obtain an initial gene pool;
step 2, calculating the gene category sensitivity information GCS value of each gene in the initial gene pool, and then coding the particles through the GCS value;
step 3, establishing a multi-objective optimization model by taking the classification accuracy of the gene subsets on the extreme learning machine ELM and the scale of the gene subsets as targets;
step 4, providing a multi-target particle swarm algorithm (MOPSO-PAG-LF) based on a preference grid and Levy flight, and continuously searching, evaluating and updating particles and maintaining an external archive by using the MOPSO-PAG-LF so as to obtain a gene subset with higher classification accuracy and smaller scale;
step 5, outputting the finally identified tumor key genes if the termination condition is met, otherwise, turning to step 4;
further, the step 1 comprises the following steps:
step 1.1, loading an original gene data set, and dividing a training set and a testing set according to the ratio of 2: 1;
step 1.2 according to the formula (1), calculating the classification information index of each gene, arranging the classification information index in a descending order, and selecting the first 400 genes to add into an initial gene pool.
Figure BDA0002210823720000071
wherein ,
Figure BDA0002210823720000079
and
Figure BDA00022108237200000710
indicates that the gene g is uprightMean of the expression levels on class (+) and negative class (-),
Figure BDA00022108237200000711
and
Figure BDA00022108237200000712
the standard deviations of the expression levels of gene g in positive (+) and negative (-) classes, respectively.
Further, the step 2 comprises the following steps:
step 2.1 calculating the GCS value of each gene in the primary gene pool according to the formula (2) and the formula (3), wherein the bigger the GCS value is, the bigger the contribution of the gene with the smaller GCS value to the classification is;
Figure BDA0002210823720000076
wherein XTrainingFor training the sample set, β sqIs the weight, w, of the s-th hidden layer node and the q-th output node of the ELM jsIs the weight of the jth input node and the s-th hidden layer node; hid(s) is the input to the s-th hidden layer node; n is a radical of gnlThe number of genes in an initial gene pool, g is an activation function of ELM, and the sigmoid function is taken.
And 2.2, encoding the particles, namely, firstly, carrying out descending order arrangement on all genes according to GCS values, randomly initializing the first 20 percent of genes to random numbers in [0, 1], initializing the rest 80 percent of genes to 0, indicating that the gene corresponding to a certain dimension is selected when the value of the position of the particle on the dimension is more than 0.5, and otherwise indicating that the gene corresponding to the dimension is not selected when the value is less than 0.5.
Further, the step 3 comprises the following steps:
step 3.1, setting evaluation indexes of the multi-target particle swarm algorithm, wherein the evaluation indexes comprise two indexes: accuracy and gene scale. f. of 1Is the accuracy acc (i), which is the ELM classification accuracy of the ith particle on the validation set, f 2For the gene scale genenum (i), the number of genes selected for particle i, to unify the two indices into a maximization problem, genenum (i) is changed to
Figure BDA0002210823720000078
d is the dimension of the sample.
Step 3.2 changing f to (f) 1,f 2) The method is used as an optimization target of the multi-target particle swarm algorithm.
Further, the step 4 comprises the following steps:
step 4.1, randomly initializing population particles, and adding a parameter flag to each particle, wherein the parameter is used for judging how long each particle has not evolved into a better particle;
step 4.2, whether the parameter flag of each particle is smaller than a preset threshold value T or not; (ii) a
4.3 if the value is less than T, evolving the particle according to the formulas (4) and (5), namely a conventional particle swarm algorithm formula, and if the value is more than T, evolving the particle according to the formulas (6) and (7) and (8) by using an improved Levy flight strategy on the particle, wherein the flag value of the particle is changed into 0;
Figure BDA0002210823720000081
Figure BDA0002210823720000083
Figure BDA0002210823720000084
here u and v follow a normal distribution:
Figure BDA0002210823720000085
and
Figure BDA0002210823720000086
wherein ,
Figure BDA0002210823720000087
is the velocity of particle i at the t +1 th iteration,
Figure BDA0002210823720000088
is the position of particle i at the t-th iteration, x pb,iFor the individual historical optimal position, x, of particle i gb,iFor the global optimal position of particle i, w is the inertial weight, typically [0.4, 0.9]]Adaptive change between c 1,c 2Is an acceleration constant, r 1,r 2Two are in [0, 1]]Generally speaking, parameter α is usually set to 0.01 to prevent it from stepping too aggressively and easily jumping out of the decision boundary, β is set to 1.5. Note that, when step S is updated, the present invention makes some perturbation to the conventional Lewy flight formula, where there is some probability of multiplying S by the global optimum particle x gb,iSubtracting the position of the current particle
Figure BDA0002210823720000089
The purpose of this is that the particle can be properly moved to the globally optimal particle x when the position of the particle is updated with the lave flight gb,iThe random jumps are directionally dependent, rather than perfectly aligned with the lave distribution.
Step 4.4 with f ═ f (f) 1,f 2) As a target function, evaluating whether the particle evolves into a better solution, judging the domination relationship between the newly generated solution and the individual optimal particle, if the new particle dominates the individual optimal particle, updating the individual optimal information of the particle and setting a parameter flag of the particle to 0; if the new particle is dominated by the individual optimal particle, the value of the attribute flag of the particle is increased by 1; if the new particle is independent of the individual optimal particle, updating the individual optimal information of the particle with a certain probability (50%) and setting the parameter flag of the particle to 0, otherwise setting the parameter flag of the particle to 0The attribute flag value is incremented by 1.
Step 4.5, dominant comparison is carried out on the particles, non-dominant solutions are added into an external archive, and maintenance is carried out on the external archive. When maintaining external archives and selecting leader particles, the invention is carried out in a mode of preferring grids, and specifically comprises the following steps: a grid as shown in FIG. 1 is first created from the values of the non-dominant solutions in the external archive on the objective function, each representing a black point Q in the grid iSo that Q is { Q ═ Q 1,Q 2,...,Q i,...,Q nDenotes the set of non-dominant solutions, n is the number of non-dominant solutions, and the grid of at least one particle in the grid is referred to herein as the active grid.
For Q iBelongs to Q, and calculates Q according to formula (9) iOf the weighted fitness value of, wherein F 1,F 2Is the fitness value of two targets, α is [0, 1]]Preference weights within, depending on F 1 and F2For the importance of the problem, the decision maker decides the parameter, and the invention sets α to 0.7, β is 1- α, and num is Q iThe number of particles in the grid, θ, is a penalty term, and is set to 0.05.
λ i=α*F 1+β*F 2-θ*num (9)
When the leader particle is selected, Q is calculated according to equation (10) iProbability of being selected P iWhen the particle to be deleted is maintained in the external archive, Q is calculated according to equation (11) iProbability of being selected P iWhere n is the total number of non-dominated solutions, and then a particle is selected as the leader particle or deleted from the archive using the roulette method. Note that here for each lambda iAre raised to the exponential power of e, the purpose of which is to let λ be iThe larger particles have larger probability to be selected, and the lambda is further enlarged iLarge particle and lambda iSmall probability of hits between particles. From λ iIt can be seen that when Q is iWhen the number of particles in the grid is large, the obtained fitness value lambda is iDue to the existence of the penalty term, the solution becomes smaller, and the solution selected in this way has higher scoreThe class accuracy rate can also make the solution sparse in the grid, greatly improving the decision efficiency of the algorithm and saving the expense of computing resources.
Figure BDA0002210823720000091
Figure BDA0002210823720000092
Step 4.6, judging whether the multi-target particle algorithm meets the termination condition, and if so, outputting a result; if not, turning to the step (4.2)
Further, the step 5 comprises the following steps:
step 5.1, repeating the above operations until the fitness function reaches a certain threshold or reaches a preset maximum iteration number, otherwise, returning to the step 4;
step 5.2 the non-dominant particles in the archive at this point may each represent the final selected subset of critical genes identifying the tumor.
Aiming at the problems that a fitness function only uses a single-target optimization scheme and lacks good interpretability, and the selected genes are not accurate enough to identify the tumor, the invention provides the method for identifying the key gene subset of the tumor by combining the Levy flight and the multi-target particle swarm optimization of the preference grid so as to obtain a more effective key gene subset of the tumor, thereby improving the accuracy of tumor identification.
The following is a brief description of the implementation of the present invention, taking tumor gene expression profile data as an example. This example selects a Brain cancer (Brain cancer) tumor expression profile dataset, containing a total of 60 samples, for a total of two subtypes: 46 representative brain cancer (Patients with classic brain cancer) samples and 14 desmoplastic brain cancers (Patientswitch proliferative brain cancer). Each sample contained 7219 genes and the data set was derived from http:// linus. nci. nih. gov/. brb/DataArchiveNew. html. Although the brain cancer tumor expression profile data set has only two categories, because the expression levels of all genes in the data set are relatively close, the key genes for identifying the tumor are difficult to obtain, and thus the prediction precision of the various classifiers on the gene subset selected by the traditional gene identification method to the sample is not high. On the data set, the specific implementation steps of the invention are as follows:
as shown in fig. 2, a tumor key gene identification method based on a multi-target particle swarm algorithm of a preference grid and a lave flight comprises the steps of initially selecting an original gene by using a classification information index, then encoding particles by using GCS information, and searching key tumor genes by using the multi-target particle swarm algorithm based on the preference grid and the lave flight, and comprises the following steps:
(1) the raw data was loaded and the data set was divided into a training set and a test set at a 2: 1 ratio, with 40 training samples and 20 test samples. 400 genes are preliminarily screened out on a training set by adopting an improved classification information index method (Han F, Sun W, Link Q-H (2014) A Novel Strategy for Gene Selection of Microarray Data Based on Gene-to-Class Sensitivity information PLoS ONE 9(5) e97530. doi: 10.1371/joumal. port. 0097530) to form an initial alternative Gene pool.
(2) The GCS value of each Gene in the primary Gene pool (Han F, Sun W, Link Q-H (2014) A NovelStrategy for Gene Selection of Microarray Data Based on Gene-to-ClassSensinity information PLoS ONE 9 (5): e97530. doi: 10.1371/joumal. port. 0097530) is calculated and the genes are sorted in descending order by GCS value, the first 20% of the genes are randomly initialized to random numbers in [0, 1], the remaining 80% of the genes are initialized to 0, the position of the particle in a certain dimension is greater than 0.5, which indicates that the Gene corresponding to the dimension is selected, otherwise, less than 0.5 indicates that the Gene is not selected.
(3) The evaluation index of the multi-target particle swarm algorithm is set, and comprises two indexes: accuracy and gene scale. f. of 1Is the accuracy acc (i), which is the ELM classification accuracy of the ith particle on the validation set, f 2For the gene scale genenum (i), the number of genes selected for particle i, to unify the two indices into a maximization problem, genenum (i) is changed to
Figure BDA0002210823720000111
d is the dimension of the sample.
(4) Selecting a key tumor gene from an initial gene pool by using a multi-target particle swarm algorithm based on preference grids and Levy flight, and specifically comprising the following steps of:
① initializing the population according to step 2, setting the parameter flag of each particle to 0, setting the threshold T to 10, setting the population size to 50, setting the maximum number of iterations to 50, setting the external archive size to be 50 consistent with the population size, setting the preference weight α to 0.7, linearly decreasing the inertia weight w from 0.9 to 0.4, and accelerating the constant c 1 and c2Is 1.5.
② if the parameter flag of the particle is less than T, the particle is evolved according to equations (4) (5), and if greater than T, the particle is evolved according to equations (6) (7) (8) with a Levy flight strategy.
③, calculating the adaptive value of each particle according to the evaluation target of step 3, and updating the historical optimum position and the global optimum position of each particle and the parameter flag of each particle.
④ make dominant comparisons of particles, add non-dominant solutions to the external archive, and maintain the external archive with a policy that favors the grid, according to step 4.5.
⑤ if the predetermined maximum number of iterations has not been reached (50 in this example), the process returns to step ②, otherwise the result is output, and all non-dominant particles in the archive represent the final identified key set of lung cancer tumor genes.
Table 1 shows the classification accuracy of ELM on the identified gene set in the embodiment of the invention, and the ELM classification 5-fold cross accuracy and the test accuracy respectively reach 86.97% and 81.22% on 3 key genes. While the 5-fold cross accuracy and the test accuracy of ELM on the 6 optimal Gene subsets selected by the Kmeans-GCSI-MBPSO-ELM method (Han F, Sun W, Link Q-H (2014) A Novel strand for the genetic selection of Microarray Data Based on Gene-to-Class sensing information. PLoS ONE 9 (5): e97530. doi: 10.1371/joumal. pole.0097530) are 88.63% and 80.40%, respectively. This further illustrates that the present invention can identify key genes associated with tumors and find key genes with fewer genes and more helpful classification performance.
TABLE 1 Classification accuracy of ELMs in different subsets of genes selected on brain cancer data sets according to the invention
Figure BDA0002210823720000112
Table 2 shows that 1000 experiments are carried out on the expression profile data of the brain cancer tumor by using the method of the invention to screen the 10 key genes for identifying the brain cancer with the highest frequency. From tables 1 and 2, it can be seen that the number of gene sets selected by the method of the present invention is small in the brain cancer data set (brain), and the genes with the gene numbers of 5931, 4413 and 18 not only appear frequently but also appear repeatedly in the selected key genes.
TABLE 2 identification of the 30 genes with the highest frequency on the brain cancer tumor expression profile dataset according to the invention
Figure BDA0002210823720000121
In the aspect of a multi-objective optimization model, a grid capable of describing decision preference is constructed by a weighting method to maintain files and select leader particles, so that the decision efficiency of an algorithm is greatly improved, and the expenditure of computing resources is saved; meanwhile, an improved Levis flight strategy is combined with a multi-target particle swarm algorithm, and the convergence performance of the algorithm on the complex multi-target optimization problem is improved. Compared with the traditional tumor key gene identification method, the method can quickly and efficiently identify the key gene subsets with fewer numbers and better classification performance in the primary gene pool through the multi-target model.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (7)

1. A tumor key gene identification method based on a multi-target particle swarm algorithm of preference grids and Levy flight is characterized by comprising the following steps:
step 1, preprocessing gene expression profile data, including dividing an original data set into a training set and a testing set, and filtering the original gene expression profile data set by using a classification information index to obtain an initial gene pool;
step 2, calculating the gene category sensitivity information GCS value of each gene in the initial gene pool, and then coding the particles through the GCS value;
step 3, establishing a multi-objective optimization model by taking the classification accuracy of the gene subsets on the extreme learning machine ELM and the scale of the gene subsets as targets;
step 4, providing a multi-target particle swarm algorithm (MOPSO-PAG-LF) based on a preference grid and Levy flight, and continuously searching, evaluating and updating particles and maintaining an external archive by using the MOPSO-PAG-LF to obtain a gene subset;
and 5, outputting the finally identified tumor key genes if the termination condition is met, otherwise, turning to the step 4.
2. The method for identifying key genes of tumors based on the multi-target particle swarm algorithm of the preference grid and the Levy flight according to claim 1, wherein the step 1 comprises the following steps:
step 1.1, loading an original gene data set, and dividing a training set and a testing set according to the ratio of 2: 1;
step 1.2 according to the formula (1), calculating the classification information index of each gene, arranging the classification information index in a descending order, and selecting the first 400 genes to add into an initial gene pool.
Figure FDA0002210823710000011
wherein ,
Figure FDA0002210823710000012
and represents the mean of the expression levels of gene g in positive (+) and negative (-) classes, and
Figure FDA0002210823710000015
the standard deviations of the expression levels of gene g in positive (+) and negative (-) classes, respectively.
3. The method for identifying key genes of tumors based on the multi-target particle swarm algorithm of the preference grid and the Levy flight according to claim 1, wherein the step 2 comprises the following steps:
step 2.1 calculating the GCS value of each gene in the primary gene pool according to the formula (2) and the formula (3), wherein the bigger the GCS value is, the bigger the contribution of the gene with the smaller GCS value to the classification is;
Figure FDA0002210823710000016
Figure FDA0002210823710000021
wherein XTrainingFor training the sample set, β sqIs the s-th hidden layer node of the ELM andweight of the q-th output node, w jsIs the weight of the jth input node and the s-th hidden layer node; hid(s) is the input to the s-th hidden layer node; n is a radical of gnlThe number of genes in an initial gene pool, g is an activation function of ELM, and a sigmoid function is taken;
and 2.2, encoding the particles, namely, firstly, carrying out descending order arrangement on all genes according to GCS values, randomly initializing the first 20 percent of genes to random numbers in [0, 1], initializing the rest 80 percent of genes to 0, indicating that the gene corresponding to a certain dimension is selected when the value of the position of the particle on the dimension is more than 0.5, and otherwise indicating that the gene corresponding to the dimension is not selected when the value is less than 0.5.
4. The method for identifying key genes of tumors based on the multi-target particle swarm algorithm of the preference grid and the Levy flight according to claim 1, wherein the step 3 comprises the following steps:
step 3.1, setting evaluation indexes of the multi-target particle swarm algorithm, wherein the evaluation indexes comprise two indexes: accuracy and gene scale. f. of 1Is the accuracy acc (i), which is the ELM classification accuracy of the ith particle on the validation set, f 2For the gene scale genenum (i), the number of genes selected for particle i, to unify the two indices into a maximization problem, genenum (i) is changed to
Figure FDA0002210823710000022
d is the dimension of the sample;
step 3.2 changing f to (f) 1,f 2) The method is used as an optimization target of the multi-target particle swarm algorithm.
5. The method for identifying key genes of tumors based on the multi-target particle swarm algorithm of the preference grid and the Levy flight according to claim 1, wherein the step 4 comprises the following steps:
step 4.1, randomly initializing population particles, and adding a parameter flag to each particle, wherein the parameter is used for judging how long each particle has not evolved into a better particle;
step 4.2, whether the parameter flag of each particle is smaller than a preset threshold value T or not;
4.3 if the value is less than T, evolving the particle according to the formulas (4) and (5), namely a conventional particle swarm algorithm formula, and if the value is more than T, evolving the particle according to the formulas (6) and (7) and (8) by using an improved Levy flight strategy on the particle, wherein the flag value of the particle is changed into 0;
Figure FDA0002210823710000023
Figure FDA0002210823710000024
Figure FDA0002210823710000025
Figure FDA0002210823710000031
here u and v follow a normal distribution, being random variables:
Figure FDA0002210823710000032
and
wherein ,
Figure FDA0002210823710000034
is the velocity of particle i at the t +1 th iteration,
Figure FDA0002210823710000035
is the position of particle i at the t-th iteration, x pb,iFor the individual historical optimal position, x, of particle i gb,iIs the global optimum position of the particle i, w is the inertial weight, c 1,c 2To addRate constant, r 1,r 2Two are in [0, 1]]Random numbers changing in the range, S is the updating step length of the Lewy flight, α and β are parameters, when the step length S is updated, the method makes some disturbance to the conventional Lewy flight formula, and a certain probability is obtained by multiplying the global optimal particle x by S gb,iSubtracting the position of the current particle
Figure FDA0002210823710000036
The purpose of this is that the particle can be properly moved to the globally optimal particle x when the position of the particle is updated with the lave flight gb,iRandom jumps that are directionally dependent, rather than perfectly aligned with the Levin distribution;
step 4.4 with f ═ f (f) 1,f 2) As a target function, evaluating whether the particle evolves into a better solution, judging the domination relationship between the newly generated solution and the individual optimal particle, if the new particle dominates the individual optimal particle, updating the individual optimal information of the particle and setting a parameter flag of the particle to 0; if the new particle is dominated by the individual optimal particle, the value of the attribute flag of the particle is increased by 1; if the new particle and the individual optimal particle are not mutually independent, updating the individual optimal information of the particle with a certain probability and setting the parameter flag of the particle to be 0, otherwise, adding 1 to the attribute flag value of the particle;
step 4.5, dominating and comparing the particles, adding a non-dominated solution into an external archive, maintaining the external archive, and performing maintenance on the external archive and selecting a leader particle by a mode of preferring a grid, specifically: first, a grid is created based on the values of the non-dominant solutions in the external archive on the objective function, each non-dominant solution representing a black point Q in the grid iSo that Q is { Q ═ Q 1,Q 2,...,Q i,...,Q nDenotes the set of so non-dominant solutions, n is the number of non-dominant solutions, and the grid of at least one particle in the grid is referred to herein as the active grid;
for Q iBelongs to Q, and calculates Q according to formula (9) iOf the weighted fitness value of, wherein F 1,F 2Is twoTarget fitness value, α, is [0, 1]]Preference weights within, depending on F 1 and F2For the importance of the problem, the decision maker decides the parameter, and β is 1- α, num is Q iThe number of particles in the grid is located, theta is a penalty term and is set to be 0.05;
λ i=α*F 1+β*F 2-θ*num (9)
when the leader particle is selected, Q is calculated according to equation (10) iProbability of being selected P iWhen the particle to be deleted is maintained in the external archive, Q is calculated according to equation (11) iProbability of being selected P iWhere n is the total number of non-dominated solutions, and then selecting a particle as the leader particle or deleted from the archive using roulette, noting that for each lambda iAre raised to the exponential power of e, the purpose of which is to let λ be iThe larger particles have larger probability to be selected, and the lambda is further enlarged iLarge particle and lambda iSmall probability of hits between particles. From λ iIt can be seen that when Q is iWhen the number of particles in the grid is large, the obtained fitness value lambda is iDue to the existence of the punishment item, the punishment item becomes smaller, so that the selected solution not only has higher classification accuracy, but also can be sparse in the grid, the decision efficiency of the algorithm is greatly improved, and the expenditure of computing resources is saved;
Figure FDA0002210823710000041
Figure FDA0002210823710000042
step 4.6, judging whether the multi-target particle algorithm meets the termination condition, and if so, outputting a result; if not, the process goes to step (4.2).
6. The method for identifying key genes of tumors based on the multi-objective particle swarm optimization algorithm of preference grids and Levis flight as claimed in claim 1, wherein w is an inertial weight and is adaptively changed between [0.4 and 0.9 ].
7. The method for identifying key genes of tumors based on the multi-objective particle swarm algorithm of the preference grid and the Levis flight as claimed in claim 1, wherein the step 5 comprises the following steps:
step 5.1, repeating the above operations until the fitness function reaches a certain threshold or reaches a preset maximum iteration number, otherwise, returning to the step 4;
step 5.2 the non-dominant particles in the archive at this point may each represent the final selected subset of critical genes identifying the tumor.
CN201910903327.2A 2019-09-23 2019-09-23 Tumor key gene identification method based on preference grid and Lewy flight multi-target particle swarm algorithm Active CN110782950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910903327.2A CN110782950B (en) 2019-09-23 2019-09-23 Tumor key gene identification method based on preference grid and Lewy flight multi-target particle swarm algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910903327.2A CN110782950B (en) 2019-09-23 2019-09-23 Tumor key gene identification method based on preference grid and Lewy flight multi-target particle swarm algorithm

Publications (2)

Publication Number Publication Date
CN110782950A true CN110782950A (en) 2020-02-11
CN110782950B CN110782950B (en) 2023-09-26

Family

ID=69383779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910903327.2A Active CN110782950B (en) 2019-09-23 2019-09-23 Tumor key gene identification method based on preference grid and Lewy flight multi-target particle swarm algorithm

Country Status (1)

Country Link
CN (1) CN110782950B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111398837A (en) * 2020-04-01 2020-07-10 重庆大学 Vehicle battery health state estimation method based on data driving

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106548041A (en) * 2016-12-08 2017-03-29 江苏大学 A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
CN109344956A (en) * 2018-12-05 2019-02-15 重庆邮电大学 Based on the SVM parameter optimization for improving Lay dimension flight particle swarm algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106548041A (en) * 2016-12-08 2017-03-29 江苏大学 A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
CN109344956A (en) * 2018-12-05 2019-02-15 重庆邮电大学 Based on the SVM parameter optimization for improving Lay dimension flight particle swarm algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
丁晓霖等: "《基于自适应网络与动态拥挤距离的多目标粒子群算法及应用》" *
凌青华等: "《一种改进的基于先验信息和微粒群算法的基因选择方法》" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111398837A (en) * 2020-04-01 2020-07-10 重庆大学 Vehicle battery health state estimation method based on data driving

Also Published As

Publication number Publication date
CN110782950B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN109754113B (en) Load prediction method based on dynamic time warping and long-and-short time memory
CN110070141B (en) Network intrusion detection method
Aliniya et al. A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm
Li et al. An ant colony optimization based dimension reduction method for high-dimensional datasets
CN110009030B (en) Sewage treatment fault diagnosis method based on stacking meta-learning strategy
CN109685653A (en) A method of fusion deepness belief network and the monitoring of the credit risk of isolated forest algorithm
CN112289391B (en) Anode aluminum foil performance prediction system based on machine learning
Zhou et al. A correlation guided genetic algorithm and its application to feature selection
Tsakiridis et al. DECO3RUM: A Differential Evolution learning approach for generating compact Mamdani fuzzy rule-based models
CN110287985B (en) Depth neural network image identification method based on variable topology structure with variation particle swarm optimization
CN110880369A (en) Gas marker detection method based on radial basis function neural network and application
CN112801140A (en) XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm
CN106548041A (en) A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
CN111079074A (en) Method for constructing prediction model based on improved sine and cosine algorithm
CN110738362A (en) method for constructing prediction model based on improved multivariate cosmic algorithm
CN110598836B (en) Metabolic analysis method based on improved particle swarm optimization algorithm
CN113255873A (en) Clustering longicorn herd optimization method, system, computer equipment and storage medium
CN116542382A (en) Sewage treatment dissolved oxygen concentration prediction method based on mixed optimization algorithm
CN110782950A (en) Tumor key gene identification method based on preference grid and Levy flight multi-target particle swarm algorithm
CN116956160A (en) Data classification prediction method based on self-adaptive tree species algorithm
CN111832645A (en) Classification data feature selection method based on discrete crow difference collaborative search algorithm
CN114117876A (en) Feature selection method based on improved Harris eagle algorithm
CN112801163B (en) Multi-target feature selection method of mouse model hippocampal biomarker based on dynamic graph structure
CN113269217A (en) Radar target classification method based on Fisher criterion
CN112308160A (en) K-means clustering artificial intelligence optimization algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant