CN117238379A - Storage medium storing gene selection method program - Google Patents

Storage medium storing gene selection method program Download PDF

Info

Publication number
CN117238379A
CN117238379A CN202311331114.XA CN202311331114A CN117238379A CN 117238379 A CN117238379 A CN 117238379A CN 202311331114 A CN202311331114 A CN 202311331114A CN 117238379 A CN117238379 A CN 117238379A
Authority
CN
China
Prior art keywords
population
value
iteration
formula
individual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311331114.XA
Other languages
Chinese (zh)
Inventor
陈慧灵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou University
Original Assignee
Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou University filed Critical Wenzhou University
Priority to CN202311331114.XA priority Critical patent/CN117238379A/en
Publication of CN117238379A publication Critical patent/CN117238379A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The present invention provides a storage medium storing a gene selection method program for gene feature selection. Acquiring a training set and a testing set through a gene data microarray data set, and determining an initialization population; binary coding is carried out on each individual of the current population by adopting a conversion function; calculating the fitness value of the current population, and updating relevant parameters in the sea squirt and moth fire suppression strategy; setting relevant parameters of a sine and cosine optimization algorithm, and updating the population by adopting an iteration formula of the sine and cosine optimization algorithm; sequentially updating the population obtained by the sine and cosine optimization algorithm through the sea squirt, the moth fire suppression and the reverse learning strategy to obtain three populations; selecting a next generation population through greedy selection; if the maximum iteration times are reached, ending the loop and outputting the optimal solution, otherwise, continuing iteration until the iterative computation is ended. The invention can screen out the gene characteristic which has the greatest contribution to the category from the genes more accurately and more efficiently, and reduces the detection cost.

Description

Storage medium storing gene selection method program
This patent application is a divisional application of chinese patent 202010982171.4 entitled "gene selection method and apparatus", which is incorporated herein by reference in its entirety.
Technical Field
The invention relates to a gene selection technology in the field of data preprocessing, in particular to a gene selection method.
Background
In the world where there is great competition today, humans are affected by various diseases, especially cancers, leukemias, etc., and medical testing helps to determine the symptoms and causes of various diseases. With the rapid development of related technologies in biomedical and health fields, a great deal of bioinformatics and clinical medical data, especially molecular biological experimental data, genetic data, have been increased at unprecedented speeds. At present, the generation and development processes of various diseases have been studied at the molecular level by human beings, and a large number of pathogenic genes are found, but the occurrence and regulation mechanisms of the pathogenic genes are poorly understood. Analysis of microarray gene expression data and protein expression data can be used to grasp physiological activity information at the molecular level.
However, the number of genes in microarray data is thousands, where some of the genetic features may be irrelevant to the mining task or there may be mutual redundancy between features. In fact, the genes actually related to sample classification are only few, and these redundant gene features may cause modeling of the detection algorithm to be excessively fitted and trained for too long, thus causing erroneous detection results, and even causing delay to cause loss of life of the patient.
In recent years, researchers in the related art have analyzed microarray data, and various types of machine learning algorithms and statistical methods, such as artificial neural networks and evolutionary algorithms, have been used to analyze gene expression data. But due to the higher dimensionality of the genetic data and the more noisy, more and more intelligent algorithms become more important in microarray data analysis. Mining the smallest gene subset greatly reduces the computational burden and "noise" caused by unrelated genes, even allows simple detection rules to be extracted, allowing accurate detection without the need for any classifier, and simplifies gene expression testing, including only a few genes rather than thousands of genes, which can significantly reduce the cost of detection, requiring further investigation of the biological relationships that may exist between these few genes and disease progression and treatment. The gene selection method based on particle swarm algorithm and bat algorithm has good classification result.
The sine and cosine optimization algorithm (Sine Cosine Algorithm, SCA) is an emerging heuristic group intelligent algorithm that uses two mathematical formulas of sine and cosine functions, and is a process of continuous exploration and development over the entire search space. However, SCA still has a very high lifting space in terms of convergence speed and convergence accuracy of the optimal solution in the process of gene screening. In this case, it is difficult to maintain an effective balance of exploration and development.
Therefore, it is necessary to provide a gene selection method, which can remove noise from gene expression data more accurately and more efficiently, and reduce detection cost.
Disclosure of Invention
On the basis of deeply researching the characteristics of gene microarray data, the invention designs a gene selection method aiming at the existing problems so as to realize more accurate and more efficient denoising of gene expression data.
Specifically, according to an aspect of the present invention, an embodiment of the present invention provides a gene selection method comprising the steps of:
s1, acquiring a training set and a testing set through a gene data microarray data set, and determining an initialization population;
s2, binary coding is carried out on each characteristic value of each individual of the current population by adopting a conversion function;
s3, calculating the fitness value of the current population, and updating related parameters in the sea squirt and moth fire suppression strategy;
s4, setting relevant parameters of a sine and cosine optimization algorithm, and updating the population by adopting an iteration formula of the sine and cosine optimization algorithm;
step S5, sequentially updating the population obtained through the sine and cosine algorithm through the goblet sea squirt, the moth fire extinguishing and reverse learning strategies to obtain three populations;
s6, selecting a next generation population through greedy selection;
and S7, if the maximum iteration times are reached, ending the loop and outputting an optimal solution, otherwise, continuing iteration until the iterative computation is ended.
According to yet another aspect of the present invention, in step S1, an initialization training sample population is set based on the training sample set obtained by feature extractioni=1, 2,., N, j=1, 2,., D, t=0, where N is the number of training sample individuals, D is the characteristic value of each training sample, X t Representing the population acquired at the t-th iteration,/->The j characteristic value of the i individual under t iterations is represented, t represents the current iteration number, and the value range is [0,1000 ]]。
According to yet another aspect of the invention, in step S2, population X is populated t Each characteristic value of each individual in the (b) is simulated into a binary coded value through a formula (1) and a formula (2);
wherein,the j-th eigenvalue representing the i-th individual generated in the t-th iteration, r is a [0,1 ]]Random number of->The jth binary encoded value representing the ith individual generated in the t-th iteration, sig represents the sigmoid function.
According to a further aspect of the invention, in step S3, the population X is calculated using equation (3) and equation (4) t And updating the optimal solution used in the goblet-sea squirt strategy and the flame F involved in the moth fire suppression strategy t And moth M t In which the flame F t Is the above group X t The obtained fitness value is recombined into a population group according to the order from small to large, and the moths M t Is X t
Wherein Fitness is provided i Indicating the fitness value, acc, of the ith individual i Represents the classification accuracy, w A Represents the accurate weight value of classification, w F The weight value of the characteristic selection number is represented, R refers to the number of which each binary individual value is '1', namely the length of a characteristic subset of the gene data; d is the dimension of the individual, i.e. the total number of attributes in the gene dataset, cc represents the number of correctly classified samples and uc represents the number of incorrectly classified samples.
According to a further aspect of the present invention, in the step S4, a relevant parameter r of a sine and cosine optimization algorithm is set 1 ,r 2 ,r 3 And r 4 And updating by adopting a formula (5) to obtain a new population:
wherein r is 1 Is [0,2 ]]Linearly decreasing function, r 2 Is [0,2 pi ]]Random number r of (2) 3 And r 4 Is [0,1 ]]Is a random number of (a) and (b),representing the ith generated at t+1 iterationsThe j-th characteristic value of the individual, +.>Is the jth binary encoded value of the ith individual at t iterations resulting from equations (1) and (2),>and (3) obtaining the j-th binary coded value of the individual corresponding to the minimum fitness value in the binary coded population by adopting the formula (3) and the formula (4) under t iterations.
According to another aspect of the present invention, in the step S5, the step of updating the population updated by the sine and cosine optimization algorithm by the ascidians, the moth fire suppression and the reverse learning strategy, respectively, to obtain three populations specifically includes:
first, the sea squirt update strategy updates the population X obtained by the formula (5) t+1 Transposed, denoted (X) t+1 ) T The method specifically comprises the following steps: when i<N/2, updating by adopting a formula (6) to obtain a first half transposed population; when i>N/2 and i<N+1, updating by using formula (7) to obtain the latter half transposed population, synthesizing transposed population, and transposed to obtain new population S t+1 Wherein N is the number of training sample individuals;
wherein,t and t max C is the current iteration number and the maximum iteration number respectively 2 And c 3 Is [0,1 ]]Random number of->Representation ofObtaining a j-th binary code value, ub, of an individual corresponding to the minimum fitness value in the binary code population by adopting a formula (3) and a formula (4) under t iterations j Is the upper bound of the j-th dimension, lb j Is the lower bound of the j-th dimension,is the current population X under t+1 iterations t+1 Transpose the value in the j-th dimension,/>Representing the current population X at t+1 iterations t+1 The (1) th bank transposes the value in the j-th dimension,>the transposed value of the ith body in the jth dimension is obtained by utilizing the updating strategy of the goblet sea squirt under t+1 iterations;
secondly, the moth fire suppression updating strategy adopts a moth navigation mode, takes flame as a wind vane searched by the moth in a search space, updates the current position in a spiral mode, and updates the population M by adopting formulas (8) - (10) t+1
Wherein,for the jth dimension value of the ith moth individual at the t+1th iteration, +.>For the jth dimension value of the ith flame individual at the t+1th iteration, +.>For the t+1th iteration, the distance between the flame and the moth is b is a constant coefficient, k is a random number from-1 to 1, n represents the maximum number of flames, t represents the current iteration number, and t max Representing the maximum number of iterations, i representing the number of current flames, round representing rounding;
finally, the reverse learning strategy is a reverse solution based on symmetry of the original solution; obtaining a reverse population O of the current population by adopting a formula (11) t+1
Wherein ub is j Is the upper bound of the j-th dimension, lb j Is the lower bound of the j-th dimension,the j-th dimension value of the i-th individual at the t+1th iteration.
According to a further aspect of the invention, in step S6, the three populations S obtained in step S5 are combined t+1 ,M t+1 And O t+1 . According to the formula (3) and the formula (4), the fitness values are obtained, and are sequenced from small to large, and the first N individuals with small fitness are screened to be used as the next generation population X t+1 Wherein N is the number of training sample individuals;
according to yet another aspect of the present invention, in step S7, if the maximum number of iterations is reached, the loop is ended and the optimal solution is output, otherwise the number of iterations is increased by 1, and step S2 is returned.
The embodiment of the invention also provides a gene selection device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the gene selection method when executing the computer program.
The invention also provides a denoising method for the gene expression data, which is characterized in that the gene expression data irrelevant to sample classification is removed by adopting the gene selection method.
The embodiment of the invention has the following beneficial effects:
aiming at the characteristics of gene microarray data, the goblet sea squirt strategy, the moth fire suppression strategy and the reverse learning strategy are combined into the SCA algorithm, so that the calculation load and noise caused by irrelevant genes are greatly reduced, even simple detection rules can be extracted, meanwhile, the gene expression test is simplified, and the detection cost can be remarkably reduced.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that it is within the scope of the invention to one skilled in the art to obtain other drawings from these drawings without inventive faculty.
Fig. 1 is a flowchart of a gene selection method according to an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.
According to a preferred embodiment of the present invention, as shown in FIG. 1, there is provided a gene selection (screening) method comprising the steps of:
s1, acquiring a training set and a testing set through a gene data microarray data set, and determining an initialization population;
s2, binary coding is carried out on each characteristic value of each individual of the current population by adopting a conversion function;
s3, calculating the fitness value of the current population, and updating related parameters in the sea squirt and moth fire suppression strategy;
s4, setting relevant parameters of a sine and cosine optimization algorithm, and updating the population by adopting an iteration formula of the sine and cosine optimization algorithm;
step S5, sequentially updating the population obtained through the sine and cosine algorithm through the goblet sea squirt, the moth fire extinguishing and reverse learning strategies to obtain three populations;
s6, selecting a next generation population through greedy selection;
and S7, if the maximum iteration times are reached, ending the loop and outputting an optimal solution, otherwise, continuing iteration until the iterative computation is ended.
Advantageously, aiming at the characteristics of gene data, the computing burden and noise caused by unrelated genes are greatly reduced by combining a sine and cosine optimization algorithm through a sea squirt strategy, a moth fire suppression strategy and a reverse learning strategy, so that the gene expression test is simplified, and the detection cost can be remarkably reduced.
According to still another preferred embodiment of the present invention, as shown in FIG. 1, there is provided a gene selection method according to the first embodiment of the present invention, the method comprising the following steps.
At present, microarray data are generally obtained by DNA microarray technology, and analysis of microarray gene expression data and protein expression data can be used to grasp physiological activity information at the molecular level, and has been widely used in the biomedical field. The number of samples in the microarray dataset is relatively small, the number of genes is thousands, and the error estimation is greatly affected by the samples. When the error is not properly estimated, improper application of the classification method occurs. To overcome this problem, classification errors are estimated by employing a verification method called K-fold cross-validation. The invention uses 10-fold intersection to verify classification results when solving accuracy in the classification process, equally divides a data set into 10 parts, wherein one part is used as a test set, the other nine parts are used as training sets, and averages the final results by 10 times of circulation. The advantage of cross-validation with 10-fold is that the training set test set for each round can be fixed and valued and errors can be reduced.
Step S1, initializing a training sample set extracted from the gene data microarray datasetTraining a sample populationi=1, 2,., N, j=1, 2,., D, t=0, where N is the number of training sample individuals, D is the number of dimensions per training sample, X t Representing the population acquired at the t-th iteration,/->The j characteristic value of the i individual under t iterations is represented, t represents the current iteration number, and the value range is [0,1000 ]]。
S2, designing a K-Nearest Neighbor (KNN) classifier according to the training sample set, and classifying;
specifically, for example, a KNN classifier is designed and classified based on a sample set, and the population X is classified t Each characteristic value of each individual in the (b) is simulated into a binary coded value through a formula (1) and a formula (2);
wherein,the j-th eigenvalue representing the i-th individual generated in the t-th iteration, r is a [0,1 ]]Random number of->The jth binary encoded value representing the ith individual generated in the t-th iteration, sig represents the sigmoid function.
Step S3, obtaining fitness values of individuals in the current population through formulas (4) and (5), sequencing the fitness values from small to large, and updating the optimal solution and the moth fire suppression strategy used in the goblet sea squirt strategyFlame F of the middle design t And moth M t In which the flame F t In particular population X obtained as described above t Is a population F formed by recombining fitness values of the two groups in order from small to large t Moth M t Namely X t
The KNN classification method is to judge which class the sample to be tested belongs to according to the distance between the test sample and the training sample, and generally select K samples closest to the test sample. When k=1, the sample to be measured is nearest to a certain neighbor sample, and the category of the sample to be measured is the same as that of the sample; when K is more than or equal to 1, according to the same category in a few samples to be tested and the latest K samples, the fitness function defined based on splitting precision in the KNN classifier is used. The KNN algorithm comprises the following steps:
first, a distance is acquired. When given test data, the distance between it and each object in the training data is calculated. The distance function determines which samples in the training set are K neighbors of the sample to be tested, the distance formula of the invention uses Euclidean distance, and the specific calculation mode is as follows
Wherein test is i Represents the ith test vector, train j Represents the jth training vector, test i,k The kth dimension value, train, representing the ith test vector j,k Represents the kth dimension value of the jth training vector.
Next, neighboring objects are found. And taking K training samples nearest to the specified distance as neighbors of the test samples.
Finally, the category is determined. And finding out the category with the largest category ratio as the category of the test sample according to the main categories to which the K neighbors belong.
Gene selection can be regarded as a multi-objective optimization problem in which two contradictory objectives are achieved, namely selecting the smallest number of genes and maximizing the classification accuracy. We therefore need to set an objective function to normalize the two objectives into one function. The specific fitness function is as follows:
wherein Fitness is provided i Indicating the fitness value, acc, of the ith individual i Represents the classification accuracy, w A Represents the accurate weight value of classification, w F The characteristic selection number weight value is represented, and R refers to the number of which each binary individual value is '1', namely the length of the characteristic subset of the gene data. D is the dimension of the individual, i.e. the total number of attributes in the gene dataset, cc represents the number of correctly classified samples and uc represents the number of incorrectly classified samples.
S4, setting relevant parameters of a sine and cosine optimization algorithm, and acquiring a population updated by the sine and cosine optimization algorithm;
specifically, for example, setting a relevant parameter r of a sine and cosine optimization algorithm 1 ,r 2 ,r 3 And r 4 And updating by adopting a formula (6) to obtain a new population:
wherein r is 1 Is [0,2 ]]Linearly decreasing function, r 2 Is [0,2 pi ]]Random number r of (2) 3 And r 4 Is [0,1 ]]Is a random number of (a) and (b),represents the jth eigenvalue of the ith individual generated at t+1 iterations,/>Is the jth binary encoded value of the ith individual at t iterations resulting from equations (1) and (2),>and (3) obtaining the j-th binary coded value of the individual corresponding to the minimum fitness value in the binary coded population by adopting the formula (4) and the formula (5) under t iterations.
S5, updating the obtained population updated by the sine and cosine optimization algorithm through the sea squirt, the moth fire suppression and the reverse learning strategy respectively to obtain three populations;
specifically, for example, first, the sea squirt update strategy updates the population X obtained by the formula (6) t+1 Transposed, denoted (X) t +1 ) T The method specifically comprises the following steps: when i<N/2, updating by adopting a formula (7) to obtain a first half transposed population; when i>N/2 and i<N+1, updating by using formula (8) to obtain the latter half transposed population, synthesizing transposed population, and transposed to obtain new population S t+1 Wherein N is the number of training sample individuals;
wherein,t and t max C is the current iteration number and the maximum iteration number respectively 2 And c 3 Is [0,1 ]]Random number of->The j-th binary code value of an individual corresponding to the minimum fitness value in the binary code population is obtained by adopting the formula (4) and the formula (5) under t iterations, ub j Is the upper bound of the j-th dimension, lb j Is the lower bound of the j-th dimension,is the current population X under t+1 iterations t+1 Transpose the value in the j-th dimension,/>Representing the current population X at t+1 iterations t+1 The (1) th bank transposes the value in the j-th dimension,>the transposed value of the ith body in the jth dimension is obtained by utilizing the updating strategy of the goblet sea squirt under t+1 iterations;
secondly, the moth fire suppression updating strategy adopts a moth navigation mode, takes flame as a wind vane searched by the moth in a search space, updates the current position in a spiral mode, and updates the population M by adopting formulas (9) - (11) t+1
Wherein,for the jth dimension value of the ith moth individual at the t+1th iteration, +.>For the jth dimension value of the ith flame individual at the t+1th iteration, +.>At the t+1st iteration, the distance between the flame and the moth is the constant, b isThe number coefficient, k is a random number from-1 to 1, n represents the maximum number of flames, t represents the current iteration number, t max Representing the maximum number of iterations, i representing the number of current flames, round representing rounding;
finally, the reverse learning strategy is a reverse solution based on symmetry of the original solution; obtaining a reverse population O of the current population by adopting a formula (11) t+1
Wherein ub is j Is the upper bound of the j-th dimension, lb j Is the lower bound of the j-th dimension,the j-th dimension value of the i-th individual at the t+1th iteration.
S6, selecting and screening out the optimal population through greedy;
specifically, for example, the three populations S obtained in the step S5 t+1 ,M t+1 And O t+1 Solving fitness values according to a formula (4) and a formula (5), sorting from small to large, and screening the first N individuals with small fitness as a next generation population X t+1 Where N is the number of individuals in the training sample.
And S7, if the maximum iteration number is reached, ending the loop and outputting an optimal solution, otherwise, adding 1 to the iteration number, and returning to the step S2.
According to a preferred embodiment of the present invention, with respect to a gene selection method provided in the first embodiment of the present invention, the second embodiment of the present invention further provides a gene selection apparatus, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the gene selection method in the first embodiment of the present invention when executing the computer program. It should be noted that, the process of executing the computer program by the processor in the second embodiment of the present invention is consistent with the process of executing each step in the gene selection method provided in the first embodiment of the present invention, and the description will be specifically made with reference to the foregoing related content.
According to a preferred embodiment of the present invention, there is also provided a method for denoising gene expression data, characterized in that the aforementioned gene selection method is employed to remove gene expression data unrelated to classification of samples.
The embodiment of the invention has the following beneficial effects:
aiming at the characteristics of gene microarray data, the goblet sea squirt strategy, the moth fire suppression strategy and the reverse learning strategy are combined into the SCA algorithm, so that the calculation load and noise caused by irrelevant genes are greatly reduced, even simple detection rules can be extracted, meanwhile, the gene expression test is simplified, and the detection cost can be remarkably reduced.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc.
The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims (6)

1. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, the computer program being adapted to be executed by a processor to implement the steps of a gene selection method, the gene selection method comprising the steps of:
s1, acquiring a training set and a testing set through a gene data microarray data set, and determining an initialization population;
s2, binary coding is carried out on each characteristic value of each individual of the current population by adopting a conversion function;
s3, calculating the fitness value of the current population, and updating related parameters in the sea squirt and moth fire suppression strategy;
s4, setting relevant parameters of a sine and cosine optimization algorithm, and updating the population by adopting an iteration formula of the sine and cosine optimization algorithm;
step S5, sequentially updating the population obtained through the sine and cosine algorithm through the goblet sea squirt, the moth fire extinguishing and reverse learning strategies to obtain three populations;
s6, selecting a next generation population through greedy selection;
s7, if the maximum iteration times are reached, ending the loop and outputting an optimal solution, otherwise, continuing iteration until the iteration calculation is finished;
wherein in step S2, population X is selected t Each characteristic value of each individual in the (b) is simulated into a binary coded value through a formula (1) and a formula (2);
wherein,the j-th eigenvalue representing the i-th individual generated in the t-th iteration, r is a [0,1 ]]Is a random number of (a) and (b),a jth binary coded value representing an ith individual generated in a t-th iteration, sig each representing a sigmoid function;
in step S3, the population X is calculated using equation (3) and equation (4) t And updating the optimal solution used in the goblet-sea squirt strategy and the flame F involved in the moth fire suppression strategy t And moth M t In which the flame F t Is the above group X t The obtained fitness value is recombined into a population group according to the order from small to large, and the moths M t Is X t
Wherein Fitness is provided i Indicating the fitness value, acc, of the ith individual i Represents the classification accuracy, w A Represents the accurate weight value of classification, w F The weight value of the characteristic selection number is represented, R refers to the number of which each binary individual value is '1', namely the length of a characteristic subset of the gene data; d is the dimension of the individual, i.e., the total number of attributes in the genetic dataset, cc represents the number of correctly classified samples, uc represents the number of incorrectly classified samples;
in the step S4, a relevant parameter r of a sine and cosine optimization algorithm is set 1 ,r 2 ,r 3 And r 4 And updating by adopting a formula (5) to obtain a new population:
wherein r is 1 Is [0,2 ]]Linearly decreasing function, r 2 Is [0,2 pi ]]Random number r of (2) 3 And r 4 Is [0,1 ]]Is a random number of (a) and (b),represents the jth eigenvalue of the ith individual generated at t+1 iterations,/>Is the jth binary encoded value, P, of the ith bin at t iterations resulting from equations (1) and (2) j t Representing the minimum fitness in a binary coding population obtained by adopting the formula (3) and the formula (4) under t iterationsThe j-th binary coded value of the individual to which the value corresponds.
2. The computer-readable storage medium of claim 1, wherein the training sample population is initialized based on the training sample set obtained by the feature extraction in step S1 Wherein N is the number of training sample individuals, D is the number of dimensions of each training sample, and X t Representing the population acquired at the t-th iteration,/->The j characteristic value of the i individual under t iterations is represented, t represents the current iteration number, and the value range is [0,1000 ]]。
3. The computer-readable storage medium according to claim 1, wherein in the step S5, the step of updating the population updated by the sine-cosine optimization algorithm by the goblet sea squirt, the moth fire suppression and the reverse learning strategy, respectively, to obtain three populations specifically includes:
first, the sea squirt update strategy updates the population X obtained by the formula (5) t+1 Transposed, denoted (X) t+1 ) T The method specifically comprises the following steps: when i<N/2, updating by adopting a formula (6) to obtain a first half transposed population; when i>N/2 and i<N+1, updating by using formula (7) to obtain the latter half transposed population, synthesizing transposed population, and transposed to obtain new population S t+1 Wherein N is the number of training sample individuals;
wherein,t and t max C is the current iteration number and the maximum iteration number respectively 2 And c 3 Is [0,1 ]]Random number, P of (2) j t The j-th binary code value of an individual corresponding to the minimum fitness value in the binary code population is obtained by adopting the formula (3) and the formula (4) under t iterations, ub j Is the upper bound of the j-th dimension, lb j Is the lower bound of the j-th dimension, +.>Is the current population X under t+1 iterations t+1 Transpose the value in the j-th dimension,/>Representing the current population X at t+1 iterations t+1 The (1) th bank transposes the value in the j-th dimension,>the transposed value of the ith body in the jth dimension is obtained by utilizing the updating strategy of the goblet sea squirt under t+1 iterations;
secondly, the moth fire suppression updating strategy adopts a moth navigation mode, takes flame as a wind vane searched by the moth in a search space, updates the current position in a spiral mode, and updates the population M by adopting formulas (8) - (10) t+1
Wherein,for the jth dimension value of the ith moth individual at the t+1th iteration, +.>For the jth dimension value of the ith flame individual at the t+1th iteration, +.>For the t+1th iteration, the distance between the flame and the moth is b is a constant coefficient, k is a random number from-1 to 1, n represents the maximum number of flames, t represents the current iteration number, and t max Representing the maximum number of iterations, i representing the number of current flames, round representing rounding;
finally, the reverse learning strategy is a reverse solution based on symmetry of the original solution; obtaining a reverse population O of the current population by adopting a formula (11) t+1
Wherein ub is j Is the upper bound of the j-th dimension, lb j Is the lower bound of the j-th dimension,the j-th dimension value of the i-th individual at the t+1th iteration.
4. The computer-readable storage medium of claim 3, wherein the three populations S obtained in step S5 are combined t+1 ,M t+1 And O t+1 Obtaining fitness values according to a formula (3) and a formula (4), sequencing from small to large, and screening the first N individuals with small fitness as a next generation population X t+1 Where N is the number of individuals in the training sample.
5. The computer-readable storage medium of claim 4, wherein if the maximum number of iterations is reached, the loop is ended and an optimal solution is output, otherwise the number of iterations is increased by 1, and step S2 is returned.
6. A noise removal device for gene expression data, comprising a memory and a processor, said memory storing a computer program, characterized in that said processor uses the gene selection method according to any one of claims 1 to 5 to remove gene expression data unrelated to sample classification when executing said computer program.
CN202311331114.XA 2020-09-17 2020-09-17 Storage medium storing gene selection method program Pending CN117238379A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311331114.XA CN117238379A (en) 2020-09-17 2020-09-17 Storage medium storing gene selection method program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202311331114.XA CN117238379A (en) 2020-09-17 2020-09-17 Storage medium storing gene selection method program
CN202010982171.4A CN112215259B (en) 2020-09-17 2020-09-17 Gene selection method and apparatus

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202010982171.4A Division CN112215259B (en) 2020-09-17 2020-09-17 Gene selection method and apparatus

Publications (1)

Publication Number Publication Date
CN117238379A true CN117238379A (en) 2023-12-15

Family

ID=74050452

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202311331114.XA Pending CN117238379A (en) 2020-09-17 2020-09-17 Storage medium storing gene selection method program
CN202010982171.4A Active CN112215259B (en) 2020-09-17 2020-09-17 Gene selection method and apparatus

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010982171.4A Active CN112215259B (en) 2020-09-17 2020-09-17 Gene selection method and apparatus

Country Status (1)

Country Link
CN (2) CN117238379A (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160890A (en) * 2021-03-13 2021-07-23 安徽师范大学 Adaptive gene regulation grid construction method and device
CN114550822A (en) * 2022-01-26 2022-05-27 深圳先进技术研究院 Propagation guidance method and device based on intelligent optimization algorithm
CN115458167A (en) * 2022-09-13 2022-12-09 郑州市中心医院 Lung cancer prediction method based on sine and cosine mechanism improved moth fire suppression algorithm

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2207119A1 (en) * 2009-01-06 2010-07-14 Koninklijke Philips Electronics N.V. Evolutionary clustering algorithm
US20170213138A1 (en) * 2016-01-27 2017-07-27 Machine Zone, Inc. Determining user sentiment in chat data
CN109145960A (en) * 2018-07-27 2019-01-04 山东大学 Based on the data characteristics selection method and system for improving particle swarm algorithm
CN109284860A (en) * 2018-08-28 2019-01-29 温州大学 A kind of prediction technique based on orthogonal reversed cup ascidian optimization algorithm
CN109344994A (en) * 2018-08-28 2019-02-15 温州大学 A kind of prediction model method based on improvement moth optimization algorithm
CN109118025A (en) * 2018-09-25 2019-01-01 新智数字科技有限公司 A kind of method and apparatus of electric system prediction

Also Published As

Publication number Publication date
CN112215259A (en) 2021-01-12
CN112215259B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
CN112215259B (en) Gene selection method and apparatus
Kusy et al. Weighted probabilistic neural network
Marcelloni Feature selection based on a modified fuzzy C-means algorithm with supervision
CN113113130A (en) Tumor individualized diagnosis and treatment scheme recommendation method
Zaman et al. Codon based back propagation neural network approach to classify hypertension gene sequences
CN112116952A (en) Gene selection method of wolf optimization algorithm based on diffusion and chaotic local search
Bhardwaj et al. A genetically optimized neural network for classification of breast cancer disease
CN115393632A (en) Image classification method based on evolutionary multi-target neural network architecture structure
Nouri-Moghaddam et al. A novel filter-wrapper hybrid gene selection approach for microarray data based on multi-objective forest optimization algorithm
Gupta et al. An optimal multi-disease prediction framework using hybrid machine learning techniques: 10.48129/kjs. splml. 19321
CN112908414A (en) Large-scale single cell typing method, system and storage medium
CN112200224B (en) Medical image feature processing method and device
Cho et al. Fuzzy Bayesian validation for cluster analysis of yeast cell-cycle data
CN113642613B (en) Medical disease feature selection method based on improved goblet sea squirt swarm algorithm
Płoński et al. Self-organising maps for classification with metropolis-hastings algorithm for supervision
CN114864002A (en) Transcription factor binding site recognition method based on deep learning
CN113066522B (en) Gene network reasoning method based on modular recognition
CN115472291A (en) Esophageal squamous carcinoma survival prediction method based on improved goblet sea squirt optimized BP neural network
CN110739028B (en) Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition
CN112800224A (en) Text feature selection method and device based on improved bat algorithm and storage medium
Zhong et al. Gestational Diabetes Mellitus Prediction Based on Two Classification Algorithms
Mumtaz et al. Evaluation of three neural network models using Wisconsin breast cancer database
CN111414935A (en) Effective mixed feature selection method based on chi-square detection algorithm and improved fruit fly optimization algorithm
CN116226629B (en) Multi-model feature selection method and system based on feature contribution
Glaros Data-driven definition of cell types based on single-cell gene expression data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination