CN117238379A

CN117238379A - Storage medium storing gene selection method program

Info

Publication number: CN117238379A
Application number: CN202311331114.XA
Authority: CN
Inventors: 陈慧灵
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2023-12-15
Also published as: CN112215259A; CN112215259B

Abstract

The present invention provides a storage medium storing a gene selection method program for gene feature selection. Acquiring a training set and a testing set through a gene data microarray data set, and determining an initialization population; binary coding is carried out on each individual of the current population by adopting a conversion function; calculating the fitness value of the current population, and updating relevant parameters in the sea squirt and moth fire suppression strategy; setting relevant parameters of a sine and cosine optimization algorithm, and updating the population by adopting an iteration formula of the sine and cosine optimization algorithm; sequentially updating the population obtained by the sine and cosine optimization algorithm through the sea squirt, the moth fire suppression and the reverse learning strategy to obtain three populations; selecting a next generation population through greedy selection; if the maximum iteration times are reached, ending the loop and outputting the optimal solution, otherwise, continuing iteration until the iterative computation is ended. The invention can screen out the gene characteristic which has the greatest contribution to the category from the genes more accurately and more efficiently, and reduces the detection cost.

Description

Storage medium storing gene selection method program

This patent application is a divisional application of chinese patent 202010982171.4 entitled "gene selection method and apparatus", which is incorporated herein by reference in its entirety.

Technical Field

The invention relates to a gene selection technology in the field of data preprocessing, in particular to a gene selection method.

Background

In the world where there is great competition today, humans are affected by various diseases, especially cancers, leukemias, etc., and medical testing helps to determine the symptoms and causes of various diseases. With the rapid development of related technologies in biomedical and health fields, a great deal of bioinformatics and clinical medical data, especially molecular biological experimental data, genetic data, have been increased at unprecedented speeds. At present, the generation and development processes of various diseases have been studied at the molecular level by human beings, and a large number of pathogenic genes are found, but the occurrence and regulation mechanisms of the pathogenic genes are poorly understood. Analysis of microarray gene expression data and protein expression data can be used to grasp physiological activity information at the molecular level.

However, the number of genes in microarray data is thousands, where some of the genetic features may be irrelevant to the mining task or there may be mutual redundancy between features. In fact, the genes actually related to sample classification are only few, and these redundant gene features may cause modeling of the detection algorithm to be excessively fitted and trained for too long, thus causing erroneous detection results, and even causing delay to cause loss of life of the patient.

In recent years, researchers in the related art have analyzed microarray data, and various types of machine learning algorithms and statistical methods, such as artificial neural networks and evolutionary algorithms, have been used to analyze gene expression data. But due to the higher dimensionality of the genetic data and the more noisy, more and more intelligent algorithms become more important in microarray data analysis. Mining the smallest gene subset greatly reduces the computational burden and "noise" caused by unrelated genes, even allows simple detection rules to be extracted, allowing accurate detection without the need for any classifier, and simplifies gene expression testing, including only a few genes rather than thousands of genes, which can significantly reduce the cost of detection, requiring further investigation of the biological relationships that may exist between these few genes and disease progression and treatment. The gene selection method based on particle swarm algorithm and bat algorithm has good classification result.

The sine and cosine optimization algorithm (Sine Cosine Algorithm, SCA) is an emerging heuristic group intelligent algorithm that uses two mathematical formulas of sine and cosine functions, and is a process of continuous exploration and development over the entire search space. However, SCA still has a very high lifting space in terms of convergence speed and convergence accuracy of the optimal solution in the process of gene screening. In this case, it is difficult to maintain an effective balance of exploration and development.

Therefore, it is necessary to provide a gene selection method, which can remove noise from gene expression data more accurately and more efficiently, and reduce detection cost.

Disclosure of Invention

On the basis of deeply researching the characteristics of gene microarray data, the invention designs a gene selection method aiming at the existing problems so as to realize more accurate and more efficient denoising of gene expression data.

Specifically, according to an aspect of the present invention, an embodiment of the present invention provides a gene selection method comprising the steps of:

s1, acquiring a training set and a testing set through a gene data microarray data set, and determining an initialization population;

s2, binary coding is carried out on each characteristic value of each individual of the current population by adopting a conversion function;

s3, calculating the fitness value of the current population, and updating related parameters in the sea squirt and moth fire suppression strategy;

s4, setting relevant parameters of a sine and cosine optimization algorithm, and updating the population by adopting an iteration formula of the sine and cosine optimization algorithm;

step S5, sequentially updating the population obtained through the sine and cosine algorithm through the goblet sea squirt, the moth fire extinguishing and reverse learning strategies to obtain three populations;

s6, selecting a next generation population through greedy selection;

and S7, if the maximum iteration times are reached, ending the loop and outputting an optimal solution, otherwise, continuing iteration until the iterative computation is ended.

According to yet another aspect of the present invention, in step S1, an initialization training sample population is set based on the training sample set obtained by feature extractioni=1, 2,., N, j=1, 2,., D, t=0, where N is the number of training sample individuals, D is the characteristic value of each training sample, X ^t Representing the population acquired at the t-th iteration,/->The j characteristic value of the i individual under t iterations is represented, t represents the current iteration number, and the value range is [0,1000 ]]。

According to yet another aspect of the invention, in step S2, population X is populated ^t Each characteristic value of each individual in the (b) is simulated into a binary coded value through a formula (1) and a formula (2);

wherein,the j-th eigenvalue representing the i-th individual generated in the t-th iteration, r is a [0,1 ]]Random number of->The jth binary encoded value representing the ith individual generated in the t-th iteration, sig represents the sigmoid function.

According to a further aspect of the invention, in step S3, the population X is calculated using equation (3) and equation (4) ^t And updating the optimal solution used in the goblet-sea squirt strategy and the flame F involved in the moth fire suppression strategy ^t And moth M ^t In which the flame F ^t Is the above group X ^t The obtained fitness value is recombined into a population group according to the order from small to large, and the moths M ^t Is X ^t ；

Wherein Fitness is provided _i Indicating the fitness value, acc, of the ith individual _i Represents the classification accuracy, w _A Represents the accurate weight value of classification, w _F The weight value of the characteristic selection number is represented, R refers to the number of which each binary individual value is '1', namely the length of a characteristic subset of the gene data; d is the dimension of the individual, i.e. the total number of attributes in the gene dataset, cc represents the number of correctly classified samples and uc represents the number of incorrectly classified samples.

According to a further aspect of the present invention, in the step S4, a relevant parameter r of a sine and cosine optimization algorithm is set ₁ ，r ₂ ，r ₃ And r ₄ And updating by adopting a formula (5) to obtain a new population:

wherein r is ₁ Is [0,2 ]]Linearly decreasing function, r ₂ Is [0,2 pi ]]Random number r of (2) ₃ And r ₄ Is [0,1 ]]Is a random number of (a) and (b),representing the ith generated at t+1 iterationsThe j-th characteristic value of the individual, +.>Is the jth binary encoded value of the ith individual at t iterations resulting from equations (1) and (2),>and (3) obtaining the j-th binary coded value of the individual corresponding to the minimum fitness value in the binary coded population by adopting the formula (3) and the formula (4) under t iterations.

According to another aspect of the present invention, in the step S5, the step of updating the population updated by the sine and cosine optimization algorithm by the ascidians, the moth fire suppression and the reverse learning strategy, respectively, to obtain three populations specifically includes:

first, the sea squirt update strategy updates the population X obtained by the formula (5) ^t+1 Transposed, denoted (X) ^t+1 ) ^T The method specifically comprises the following steps: when i<N/2, updating by adopting a formula (6) to obtain a first half transposed population; when i>N/2 and i<N+1, updating by using formula (7) to obtain the latter half transposed population, synthesizing transposed population, and transposed to obtain new population S ^t+1 Wherein N is the number of training sample individuals;

wherein,t and t _max C is the current iteration number and the maximum iteration number respectively ₂ And c ₃ Is [0,1 ]]Random number of->Representation ofObtaining a j-th binary code value, ub, of an individual corresponding to the minimum fitness value in the binary code population by adopting a formula (3) and a formula (4) under t iterations _j Is the upper bound of the j-th dimension, lb _j Is the lower bound of the j-th dimension,is the current population X under t+1 iterations ^t+1 Transpose the value in the j-th dimension,/>Representing the current population X at t+1 iterations ^t+1 The (1) th bank transposes the value in the j-th dimension,>the transposed value of the ith body in the jth dimension is obtained by utilizing the updating strategy of the goblet sea squirt under t+1 iterations;

secondly, the moth fire suppression updating strategy adopts a moth navigation mode, takes flame as a wind vane searched by the moth in a search space, updates the current position in a spiral mode, and updates the population M by adopting formulas (8) - (10) ^t+1 ；

Wherein,for the jth dimension value of the ith moth individual at the t+1th iteration, +.>For the jth dimension value of the ith flame individual at the t+1th iteration, +.>For the t+1th iteration, the distance between the flame and the moth is b is a constant coefficient, k is a random number from-1 to 1, n represents the maximum number of flames, t represents the current iteration number, and t _max Representing the maximum number of iterations, i representing the number of current flames, round representing rounding;

finally, the reverse learning strategy is a reverse solution based on symmetry of the original solution; obtaining a reverse population O of the current population by adopting a formula (11) ^t+1 ；

Wherein ub is _j Is the upper bound of the j-th dimension, lb _j Is the lower bound of the j-th dimension,the j-th dimension value of the i-th individual at the t+1th iteration.

According to a further aspect of the invention, in step S6, the three populations S obtained in step S5 are combined ^t+1 ，M ^t+1 And O ^t+1 . According to the formula (3) and the formula (4), the fitness values are obtained, and are sequenced from small to large, and the first N individuals with small fitness are screened to be used as the next generation population X ^t+1 Wherein N is the number of training sample individuals;

according to yet another aspect of the present invention, in step S7, if the maximum number of iterations is reached, the loop is ended and the optimal solution is output, otherwise the number of iterations is increased by 1, and step S2 is returned.

The embodiment of the invention also provides a gene selection device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the gene selection method when executing the computer program.

The invention also provides a denoising method for the gene expression data, which is characterized in that the gene expression data irrelevant to sample classification is removed by adopting the gene selection method.

The embodiment of the invention has the following beneficial effects:

aiming at the characteristics of gene microarray data, the goblet sea squirt strategy, the moth fire suppression strategy and the reverse learning strategy are combined into the SCA algorithm, so that the calculation load and noise caused by irrelevant genes are greatly reduced, even simple detection rules can be extracted, meanwhile, the gene expression test is simplified, and the detection cost can be remarkably reduced.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that it is within the scope of the invention to one skilled in the art to obtain other drawings from these drawings without inventive faculty.

Fig. 1 is a flowchart of a gene selection method according to an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.

According to a preferred embodiment of the present invention, as shown in FIG. 1, there is provided a gene selection (screening) method comprising the steps of:

s6, selecting a next generation population through greedy selection;

Advantageously, aiming at the characteristics of gene data, the computing burden and noise caused by unrelated genes are greatly reduced by combining a sine and cosine optimization algorithm through a sea squirt strategy, a moth fire suppression strategy and a reverse learning strategy, so that the gene expression test is simplified, and the detection cost can be remarkably reduced.

According to still another preferred embodiment of the present invention, as shown in FIG. 1, there is provided a gene selection method according to the first embodiment of the present invention, the method comprising the following steps.

At present, microarray data are generally obtained by DNA microarray technology, and analysis of microarray gene expression data and protein expression data can be used to grasp physiological activity information at the molecular level, and has been widely used in the biomedical field. The number of samples in the microarray dataset is relatively small, the number of genes is thousands, and the error estimation is greatly affected by the samples. When the error is not properly estimated, improper application of the classification method occurs. To overcome this problem, classification errors are estimated by employing a verification method called K-fold cross-validation. The invention uses 10-fold intersection to verify classification results when solving accuracy in the classification process, equally divides a data set into 10 parts, wherein one part is used as a test set, the other nine parts are used as training sets, and averages the final results by 10 times of circulation. The advantage of cross-validation with 10-fold is that the training set test set for each round can be fixed and valued and errors can be reduced.

Step S1, initializing a training sample set extracted from the gene data microarray datasetTraining a sample populationi=1, 2,., N, j=1, 2,., D, t=0, where N is the number of training sample individuals, D is the number of dimensions per training sample, X ^t Representing the population acquired at the t-th iteration,/->The j characteristic value of the i individual under t iterations is represented, t represents the current iteration number, and the value range is [0,1000 ]]。

S2, designing a K-Nearest Neighbor (KNN) classifier according to the training sample set, and classifying;

specifically, for example, a KNN classifier is designed and classified based on a sample set, and the population X is classified ^t Each characteristic value of each individual in the (b) is simulated into a binary coded value through a formula (1) and a formula (2);

Step S3, obtaining fitness values of individuals in the current population through formulas (4) and (5), sequencing the fitness values from small to large, and updating the optimal solution and the moth fire suppression strategy used in the goblet sea squirt strategyFlame F of the middle design ^t And moth M ^t In which the flame F ^t In particular population X obtained as described above ^t Is a population F formed by recombining fitness values of the two groups in order from small to large ^t Moth M ^t Namely X ^t 。

The KNN classification method is to judge which class the sample to be tested belongs to according to the distance between the test sample and the training sample, and generally select K samples closest to the test sample. When k=1, the sample to be measured is nearest to a certain neighbor sample, and the category of the sample to be measured is the same as that of the sample; when K is more than or equal to 1, according to the same category in a few samples to be tested and the latest K samples, the fitness function defined based on splitting precision in the KNN classifier is used. The KNN algorithm comprises the following steps:

first, a distance is acquired. When given test data, the distance between it and each object in the training data is calculated. The distance function determines which samples in the training set are K neighbors of the sample to be tested, the distance formula of the invention uses Euclidean distance, and the specific calculation mode is as follows

Wherein test is _i Represents the ith test vector, train _j Represents the jth training vector, test _i,k The kth dimension value, train, representing the ith test vector _j,k Represents the kth dimension value of the jth training vector.

Next, neighboring objects are found. And taking K training samples nearest to the specified distance as neighbors of the test samples.

Finally, the category is determined. And finding out the category with the largest category ratio as the category of the test sample according to the main categories to which the K neighbors belong.

Gene selection can be regarded as a multi-objective optimization problem in which two contradictory objectives are achieved, namely selecting the smallest number of genes and maximizing the classification accuracy. We therefore need to set an objective function to normalize the two objectives into one function. The specific fitness function is as follows:

wherein Fitness is provided _i Indicating the fitness value, acc, of the ith individual _i Represents the classification accuracy, w _A Represents the accurate weight value of classification, w _F The characteristic selection number weight value is represented, and R refers to the number of which each binary individual value is '1', namely the length of the characteristic subset of the gene data. D is the dimension of the individual, i.e. the total number of attributes in the gene dataset, cc represents the number of correctly classified samples and uc represents the number of incorrectly classified samples.

S4, setting relevant parameters of a sine and cosine optimization algorithm, and acquiring a population updated by the sine and cosine optimization algorithm;

specifically, for example, setting a relevant parameter r of a sine and cosine optimization algorithm ₁ ，r ₂ ，r ₃ And r ₄ And updating by adopting a formula (6) to obtain a new population:

wherein r is ₁ Is [0,2 ]]Linearly decreasing function, r ₂ Is [0,2 pi ]]Random number r of (2) ₃ And r ₄ Is [0,1 ]]Is a random number of (a) and (b),represents the jth eigenvalue of the ith individual generated at t+1 iterations,/>Is the jth binary encoded value of the ith individual at t iterations resulting from equations (1) and (2),>and (3) obtaining the j-th binary coded value of the individual corresponding to the minimum fitness value in the binary coded population by adopting the formula (4) and the formula (5) under t iterations.

S5, updating the obtained population updated by the sine and cosine optimization algorithm through the sea squirt, the moth fire suppression and the reverse learning strategy respectively to obtain three populations;

specifically, for example, first, the sea squirt update strategy updates the population X obtained by the formula (6) ^t+1 Transposed, denoted (X) ^t ⁺¹ ) ^T The method specifically comprises the following steps: when i<N/2, updating by adopting a formula (7) to obtain a first half transposed population; when i>N/2 and i<N+1, updating by using formula (8) to obtain the latter half transposed population, synthesizing transposed population, and transposed to obtain new population S ^t+1 Wherein N is the number of training sample individuals;

wherein,t and t _max C is the current iteration number and the maximum iteration number respectively ₂ And c ₃ Is [0,1 ]]Random number of->The j-th binary code value of an individual corresponding to the minimum fitness value in the binary code population is obtained by adopting the formula (4) and the formula (5) under t iterations, ub _j Is the upper bound of the j-th dimension, lb _j Is the lower bound of the j-th dimension,is the current population X under t+1 iterations ^t+1 Transpose the value in the j-th dimension,/>Representing the current population X at t+1 iterations ^t+1 The (1) th bank transposes the value in the j-th dimension,>the transposed value of the ith body in the jth dimension is obtained by utilizing the updating strategy of the goblet sea squirt under t+1 iterations;

secondly, the moth fire suppression updating strategy adopts a moth navigation mode, takes flame as a wind vane searched by the moth in a search space, updates the current position in a spiral mode, and updates the population M by adopting formulas (9) - (11) ^t+1 ；

Wherein,for the jth dimension value of the ith moth individual at the t+1th iteration, +.>For the jth dimension value of the ith flame individual at the t+1th iteration, +.>At the t+1st iteration, the distance between the flame and the moth is the constant, b isThe number coefficient, k is a random number from-1 to 1, n represents the maximum number of flames, t represents the current iteration number, t _max Representing the maximum number of iterations, i representing the number of current flames, round representing rounding;

S6, selecting and screening out the optimal population through greedy;

specifically, for example, the three populations S obtained in the step S5 ^t+1 ，M ^t+1 And O ^t+1 Solving fitness values according to a formula (4) and a formula (5), sorting from small to large, and screening the first N individuals with small fitness as a next generation population X ^t+1 Where N is the number of individuals in the training sample.

And S7, if the maximum iteration number is reached, ending the loop and outputting an optimal solution, otherwise, adding 1 to the iteration number, and returning to the step S2.

According to a preferred embodiment of the present invention, with respect to a gene selection method provided in the first embodiment of the present invention, the second embodiment of the present invention further provides a gene selection apparatus, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the gene selection method in the first embodiment of the present invention when executing the computer program. It should be noted that, the process of executing the computer program by the processor in the second embodiment of the present invention is consistent with the process of executing each step in the gene selection method provided in the first embodiment of the present invention, and the description will be specifically made with reference to the foregoing related content.

According to a preferred embodiment of the present invention, there is also provided a method for denoising gene expression data, characterized in that the aforementioned gene selection method is employed to remove gene expression data unrelated to classification of samples.

The embodiment of the invention has the following beneficial effects:

Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc.

The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims

1. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, the computer program being adapted to be executed by a processor to implement the steps of a gene selection method, the gene selection method comprising the steps of:

s6, selecting a next generation population through greedy selection;

s7, if the maximum iteration times are reached, ending the loop and outputting an optimal solution, otherwise, continuing iteration until the iteration calculation is finished;

wherein in step S2, population X is selected ^t Each characteristic value of each individual in the (b) is simulated into a binary coded value through a formula (1) and a formula (2);

wherein,the j-th eigenvalue representing the i-th individual generated in the t-th iteration, r is a [0,1 ]]Is a random number of (a) and (b),a jth binary coded value representing an ith individual generated in a t-th iteration, sig each representing a sigmoid function;

in step S3, the population X is calculated using equation (3) and equation (4) ^t And updating the optimal solution used in the goblet-sea squirt strategy and the flame F involved in the moth fire suppression strategy ^t And moth M ^t In which the flame F ^t Is the above group X ^t The obtained fitness value is recombined into a population group according to the order from small to large, and the moths M ^t Is X ^t ；

Wherein Fitness is provided _i Indicating the fitness value, acc, of the ith individual _i Represents the classification accuracy, w _A Represents the accurate weight value of classification, w _F The weight value of the characteristic selection number is represented, R refers to the number of which each binary individual value is '1', namely the length of a characteristic subset of the gene data; d is the dimension of the individual, i.e., the total number of attributes in the genetic dataset, cc represents the number of correctly classified samples, uc represents the number of incorrectly classified samples;

in the step S4, a relevant parameter r of a sine and cosine optimization algorithm is set ₁ ，r ₂ ，r ₃ And r ₄ And updating by adopting a formula (5) to obtain a new population:

wherein r is ₁ Is [0,2 ]]Linearly decreasing function, r ₂ Is [0,2 pi ]]Random number r of (2) ₃ And r ₄ Is [0,1 ]]Is a random number of (a) and (b),represents the jth eigenvalue of the ith individual generated at t+1 iterations,/>Is the jth binary encoded value, P, of the ith bin at t iterations resulting from equations (1) and (2) _j ^t Representing the minimum fitness in a binary coding population obtained by adopting the formula (3) and the formula (4) under t iterationsThe j-th binary coded value of the individual to which the value corresponds.

2. The computer-readable storage medium of claim 1, wherein the training sample population is initialized based on the training sample set obtained by the feature extraction in step S1 Wherein N is the number of training sample individuals, D is the number of dimensions of each training sample, and X ^t Representing the population acquired at the t-th iteration,/->The j characteristic value of the i individual under t iterations is represented, t represents the current iteration number, and the value range is [0,1000 ]]。

3. The computer-readable storage medium according to claim 1, wherein in the step S5, the step of updating the population updated by the sine-cosine optimization algorithm by the goblet sea squirt, the moth fire suppression and the reverse learning strategy, respectively, to obtain three populations specifically includes:

wherein,t and t _max C is the current iteration number and the maximum iteration number respectively ₂ And c ₃ Is [0,1 ]]Random number, P of (2) _j ^t The j-th binary code value of an individual corresponding to the minimum fitness value in the binary code population is obtained by adopting the formula (3) and the formula (4) under t iterations, ub _j Is the upper bound of the j-th dimension, lb _j Is the lower bound of the j-th dimension, +.>Is the current population X under t+1 iterations ^t+1 Transpose the value in the j-th dimension,/>Representing the current population X at t+1 iterations ^t+1 The (1) th bank transposes the value in the j-th dimension,>the transposed value of the ith body in the jth dimension is obtained by utilizing the updating strategy of the goblet sea squirt under t+1 iterations;

4. The computer-readable storage medium of claim 3, wherein the three populations S obtained in step S5 are combined ^t+1 ，M ^t+1 And O ^t+1 Obtaining fitness values according to a formula (3) and a formula (4), sequencing from small to large, and screening the first N individuals with small fitness as a next generation population X ^t+1 Where N is the number of individuals in the training sample.

5. The computer-readable storage medium of claim 4, wherein if the maximum number of iterations is reached, the loop is ended and an optimal solution is output, otherwise the number of iterations is increased by 1, and step S2 is returned.

6. A noise removal device for gene expression data, comprising a memory and a processor, said memory storing a computer program, characterized in that said processor uses the gene selection method according to any one of claims 1 to 5 to remove gene expression data unrelated to sample classification when executing said computer program.