CN113571134A - Method and device for selecting gene data characteristics based on backbone particle swarm optimization - Google Patents

Method and device for selecting gene data characteristics based on backbone particle swarm optimization Download PDF

Info

Publication number
CN113571134A
CN113571134A CN202110858994.0A CN202110858994A CN113571134A CN 113571134 A CN113571134 A CN 113571134A CN 202110858994 A CN202110858994 A CN 202110858994A CN 113571134 A CN113571134 A CN 113571134A
Authority
CN
China
Prior art keywords
gene
algorithm
gbest
instance
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110858994.0A
Other languages
Chinese (zh)
Other versions
CN113571134B (en
Inventor
许镇义
潘凯
程凡
康宇
曹洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Original Assignee
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Artificial Intelligence of Hefei Comprehensive National Science Center filed Critical Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority to CN202110858994.0A priority Critical patent/CN113571134B/en
Publication of CN113571134A publication Critical patent/CN113571134A/en
Application granted granted Critical
Publication of CN113571134B publication Critical patent/CN113571134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Computing Systems (AREA)
  • Public Health (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Physiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Biochemistry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method and a device for selecting gene data characteristics based on a backbone particle swarm algorithm, which are based on a gene disease data set, carry out initialization on gene disease data set population by initializing the gene disease data set, and carry out random initialization on gene characteristics by a random grouping algorithm to divide the gene characteristics into four groups; deleting part of examples in the training set by using an agent example algorithm to generate an agent example set; and performing Tmax iteration on each group of gene features through a backbone particle swarm algorithm, wherein the Tmax iteration is divided into two stages, and evaluating the function adaptive value of the particles through a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction to accelerate optimization speed and find an optimization direction in global search, the second stage adopts a local search algorithm to enable the population to find an optimal solution after finding the optimization direction, and the optimal solution is that the gene features with the best effect are found and output. The invention improves the classification precision of gene data and obtains good effect on a plurality of disease gene data sets.

Description

Method and device for selecting gene data characteristics based on backbone particle swarm optimization
Technical Field
The invention relates to the technical field of large-scale feature selection, in particular to a method and a device for selecting gene data features based on a backbone particle swarm algorithm.
Background
In data mining and machine learning, a large number of features are typically involved, but not all features are essential, as many of them are redundant or even redundant, which may degrade the performance of the algorithm, the choice being aimed at solving the problem by selecting a subset from the original set of features. Feature selection is a challenging task, however, mainly because the search space grows exponentially as the dimensions increase, with evolutionary algorithms being known for global search capabilities. For large-scale feature selection, the main challenge is that the number of instances is large, so that the evaluation stage is long in calculation time, and the classification precision is low due to high dimensionality.
However, due to the characteristics of small sample quantity, high dimension, high noise, high redundancy and the like of gene expression profile data, great difficulty is brought to deep and accurate mining of biomedical knowledge and tumor information gene selection contained in the gene expression profile. The gene expression profile data contains the expression levels of all measurable genes in tissue cells, but only a small number of genes are actually associated with a sample class.
Disclosure of Invention
The invention provides a backbone particle swarm algorithm-based gene data feature selection method, which can solve the technical problems and takes a backbone particle swarm algorithm as a framework, in particular to a proxy example algorithm, a random grouping algorithm and a random grouping device.
In order to achieve the purpose, the invention adopts the following technical scheme:
a gene data feature selection method based on a backbone particle swarm algorithm comprises the following steps:
based on the gene disease data set, the following steps are carried out by a computer device,
s1, initializing gene disease data set populations, and randomly initializing gene characteristics through a random grouping algorithm to divide the gene characteristics into four groups;
s2, deleting part of examples in the training set by using an agent example algorithm to generate an agent example set;
s3, performing Tmax iteration by backbone particle swarm optimization aiming at each group of gene features, wherein in the Tmax iteration, the Tmax iteration is divided into two stages, and the function adaptive values of the particles are evaluated by a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage adopts a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene feature with the best effect;
s4, outputting the gene characteristics with the best effect.
Further, the step of the proxy instance algorithm in S2 includes:
firstly, deleting noise examples in a training set by an algorithm, and if the examples are wrongly classified by k neighbors, deleting the examples as the noise examples;
then, for each remaining instance, calculating an "enemy" distance, i.e., the closest distance of each instance to the different classes of instances, and if the "enemy" distance is greater, the instances can be considered to be farther away from the boundary, and the instances are sorted by the "enemy" distance, and the higher the "enemy" distance is, the higher the enemy distance is, the priority is to be deleted;
finally, for each instance, establishing a nearest neighbor list and an association list, and if the deletion of one instance does not influence the classification of the rest instances in the rest S, deleting the instance; when an instance is deleted, it will delete P from the neighbor list in its association list, and then find a new neighbor, so that there are still k neighbors in their list, and when they find a new neighbor N, they will also add it to the association list of N;
and finally, the deleted S is the proxy instance data set.
Further, the pseudo code of the proxy instance algorithm is as follows:
parameters are as follows: t training set
10)S=T
11) Each instance P in FOR S
(1) If the k neighbor class labels of P are not consistent with the original labels of P, it is considered as a noise example
(2) Deleting P from S
12)ENDFOR
13) Each instance P in FOR S
(1) Find N +1 neighbors N in P1....k+1
(2) Adding P to N + 1N1....k+1In the association list of
14)ENDFOR
15) Calculate the "enemy" distance for each instance
16) Each instance P in FOR S
(1) with the number of instances in the association list of P that were successfully classified if P exists
(2) without is the number of instances in the association list of P that were misclassified if P was deleted
(3)IF without>=with
Removing P from S
② A in the FOR P association list
1) Delete P from neighbor list in A
2) Re-finding A's neighbors
3) Add A to the association list of the new neighbor
③ENDFOR
N neighbors in FOR P
1) Deleting P from the association list of N
⑤ENDFOR
(4)ENDIF
17)ENDFOR
18) And returning to the S.
Further, the random grouping algorithm in S1 includes the following steps,
initializing, randomly numbering the features, and randomly splitting the features into four groups;
in the iterative process, if the population falls into a local optimal situation, the group splitting is carried out, and the new group number is 2 times the original group number.
Further, in the first stage of S3, only the specified direction needs to be found in the search space initially, so that the proxy instance is used to replace the original evaluation function, if the local optimum is trapped, the small group is split, the exploration is continued, and if the small group is split to the minimum, that is, the number of the small group is the data dimension, the second stage is entered;
in the second stage, the original data set in the training set is used for evaluation, and a local search algorithm is used, all the gbest information in the second stage is stored, and the current gbest is improved by using the gbest information.
Further, the step of the local search algorithm in S3 includes,
suppose SbestIs the set of all features selected by the previous gbest, all features have a score; constructing a local candidate scheme with the size of p/10 by local search, wherein p is the size of the current population, and each local candidate scheme is smaller than the size of the subset selected by the current gbest;
using tournament index method to derive from SbestSelecting features, wherein the features with higher scores have higher probability of being selected, so that the more features appear in the past gbest, the higher the appearance frequency is, and the more repeated features may appear, so that the local candidate scheme may be smaller than the current gbest in size;
when p/10 local candidate schemes are selected, the proxy model examples are used for evaluation, so that the best local candidate feature subset can be quickly found; and finally, comparing the best candidate feature subset with the current gbest, evaluating by using a complete training set example, and setting the more excellent feature subset as the gbest.
Further, the pseudo code of the local search algorithm is as follows,
parameters are as follows: sbestFeature set of previous gbest | features selected by current gbest
i.FOR i=1:p/10
(1) Selecting | gbest | feature composition subset P by tournamenti
(2) Evaluating the subset P with a proxy instancei
ii.ENDFOR
From P1To Pp/10Selecting a subset P of the top showk
iv. mixing PkCompared with the evaluation of the current gbest based on the original training set
v. put the features selected by gbest into Sbest
Set the superior party to gbest
Return gbest.
Further, the local search algorithm further includes calculating SbestThe specific steps of the score of the feature appearing in (1) are as follows,
the scores are defined as follows:
scoref=freqf+gcf
wherein:
Figure BDA0003185069640000051
Figure BDA0003185069640000052
the first part of the score is frequency, measured as the number of occurrences of the feature in the past gbest, if more occurrences, representing the better quality of the feature;
the second part is to set the feature to 1/| gbest |, if it appears in the current gbest, as compared to the current gbest.
On the other hand, the invention also discloses a gene data characteristic selection system based on the backbone particle swarm optimization, which comprises the following units,
the initialization unit is used for initializing gene disease data set population and randomly initializing gene characteristics through a random grouping algorithm to be divided into four groups;
the agent instance set setting unit is used for deleting part of instances in the training set by using an agent instance algorithm to generate an agent instance set;
the optimal gene characteristic determining unit is used for carrying out Tmax iteration on each group of gene characteristics through a backbone particle swarm algorithm, wherein the Tmax iteration is divided into two stages, the function adaptive values of the particles are evaluated through a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage utilizes a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene characteristic with the best effect;
and the data output unit is used for outputting the gene characteristics with the best effect.
In a third aspect, the present invention also discloses a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method as described above.
According to the technical scheme, aiming at the defects of long training time, low classification precision and the like of the current characteristic gene selection algorithm, the invention designs the high-efficiency characteristic gene selection and classification related algorithm, selects fewer characteristic genes and keeps or even improves the classification precision of the related model. The method is based on a backbone particle swarm optimization (BBPSO), and is optimized from two aspects of an evaluation function and a search mechanism. For an evaluation function, deleting a part of examples by using a proxy example algorithm in a first stage to reduce the time when KNN evaluation is achieved; for a search mechanism, in the first stage, the search mechanism is divided into four groups by random grouping to carry out iterative evolution, if the search mechanism is trapped in local optimum, the groups are divided and then iterated until the search mechanism cannot be divided into four groups, and the search mechanism enters the second stage, so that the operation cost can be reduced in the early stage of the search stage, the optimization direction of particles can be quickly found, and the search mechanism can gradually approach the target in the second stage. The algorithm of the present invention achieves good results on multiple disease gene data sets.
The invention provides a method for randomly grouping dimensions by taking a backbone particle swarm algorithm as a frame so as to achieve the purpose of reducing the dimensions, increase the classification precision and reduce the time of a search mechanism, and simultaneously delete part of examples and reduce the number of the examples so as to reduce the time of an evaluation stage. The experimental result on the 500-10000-dimensional gene disease data set shows that the classification precision of the gene data is improved by the provided algorithm, thereby showing that the pathogenic gene is effectively selected.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
As shown in fig. 1, the method for selecting gene data features based on backbone particle swarm optimization described in this embodiment includes removing a part of instances by a proxy instance algorithm to form a proxy instance set, so as to reduce evaluation time; in the first stage, the whole population is still in the global large-scale search, a rough optimization direction needs to be found, the dimension reduction data is subjected to evolutionary computation, and an agent instance set is used for evaluation. If the system falls into the local optimum for a period of time, carrying out subgroup splitting until the system cannot be subdivided, and entering a second stage; in the second stage, after the population finds the optimization direction, more detailed search is carried out, and all information which is once the global optimal particle (gbest) is stored while evolution calculation is carried out, so as to guide the current global optimal particle (gbest). Also, the raw data set is used at this stage for evaluation.
Specifically, the embodiment of the invention comprises the following steps:
s1, initializing gene disease data set populations, and randomly initializing gene characteristics through a random grouping algorithm to divide the gene characteristics into four groups;
s2, deleting part of examples in the training set by using an agent example algorithm to generate an agent example set;
s3, performing Tmax iteration by backbone particle swarm optimization aiming at each group of gene features, wherein in the Tmax iteration, the Tmax iteration is divided into two stages, and the function adaptive values of the particles are evaluated by a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage adopts a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene feature with the best effect;
s4, outputting the gene characteristics with the best effect.
The following are specifically described:
first, the following proxy instance algorithm is introduced
In most feature selection classification algorithms, the K-NN algorithm is used as a classifier. However, because the distance between each particle and each particle needs to be calculated and all training examples need to be stored, a large memory requirement is required. And all instances must be searched to classify the input vector, so the speed is slow in the classification process. In addition, some noise instances also reduce generalization accuracy.
In the embodiment of the present invention, an example selection algorithm is used to select a few representative examples to form a new example set, which is called a proxy training set, where the proxy training set includes information in the original training set, and when calculating the fitness, the proxy training set may be used to estimate the calculated fitness, so as to achieve the purpose of reducing the calculation time.
The whole agent instance algorithm is as follows: first, noise instances are removed, and instances that are misclassified by the k nearest neighbors are removed, which eliminates the noise instances, and some boundary instances are removed, which can smooth the decision boundary, which helps avoid overfitting the data; next, those central instances are deleted, mainly because the classification performance of removing the central instance's impact is lower than that of removing the boundary instance, and for each instance, if deleting an instance does not degrade the correct classification of its neighbor instances, then that instance will be deleted. The specific pseudo code is as follows:
proxy instance algorithm (correct _ Distance):
parameters are as follows: t training set
19)S=T
20) Each instance P in FOR S
(1) If the k neighbor class labels of P are not consistent with the original labels of P, it is considered as a noise example
(2) Deleting P from S
21)ENDFOR
22) Each instance P in FOR S
(1) Find N +1 neighbors N in P1....k+1
(2) Adding P to N + 1N1....k+1In the association list of
23)ENDFOR
24) Calculate the "enemy" distance for each instance
25) Each instance P in FOR S
(1) with the number of instances in the association list of P that were successfully classified if P exists
(2) without is the number of instances in the association list of P that were misclassified if P was deleted
(3)IF without>=with
Removing P from S
② A in the FOR P association list
1) Delete P from neighbor list in A
2) Re-finding A's neighbors
3) Add A to the association list of the new neighbor
③ENDFOR
N neighbors in FOR P
1) Deleting P from the association list of N
⑤ENDFOR
(4)ENDIF
26)ENDFOR
27) And returning to the S.
First, the algorithm deletes the noisy instances first, which it considers as a noisy instance to delete if the instance is misclassified by k neighbors. Subsequently, for each remaining instance, an "enemy" distance, i.e., the closest distance of each instance to the heterogeneous instances, is calculated, and if the "enemy" distance is greater, the instances can be considered to be further from the boundary, and the instances are sorted by the "enemy" distance, which will be preferentially deleted. Finally, for each instance, a nearest neighbor list is established, as well as an association list. An instance is deleted if its deletion does not affect the classification of the remaining instances in the remaining S. When an instance is deleted, it deletes P from the neighbor list in its association list and then finds a new neighbor so that there are still k neighbors in their list, and when they find a new neighbor N, they also add it to the association list of N. And finally, the deleted S is the proxy instance data set.
In PSO, the entire search space is explored by the population to locate the promising space. In populations early in the iteration, the location where the global optimal solution is likely can be estimated, and so can be evaluated with an alternative example dataset. In the later stages of the iteration, the entire training set will be used.
Random grouping algorithm
Random grouping
This subsection mainly teaches optimization of the algorithm at the search stage, the search space for feature selection grows exponentially with the number of features, when high-dimensional data is applied to feature selection, a very large memory and computation time are usually required, and the complex search space for high-dimensional data also brings a great challenge to the conventional PSO.
The random grouping algorithm rules are as follows:
initializing, randomly numbering the features, and randomly splitting the features into four groups;
in the iterative process, if the population falls into a local optimal situation, the group splitting is carried out, and the new group number is 2 times the original group number.
For example, the following table:
Figure BDA0003185069640000101
a0 in the first row represents that the feature subset does not select the feature, and a 1 represents that the feature subset selects the feature. It can be seen that after random grouping, the indexes of each dimension are disturbed, and assuming that the indexes are initially divided into 4 groups, the indexes of 1, 5, 9 and 13 are all set to 1, and similarly, the indexes of 3, 7, 11 and 15 are all set to 0, so that the original 16-dimensional feature selection problem is reduced to a 4-dimensional feature selection problem.
By random grouping, the original expensive search space is reduced to the problem of feature selection with lower dimension, the direction of population optimization evolution can be found at an early stage, and a large amount of search time can be reduced at the early stage. Once the population falls into the local optimum, the group splitting is performed, and as described above, once the gbest is not updated in the p generation, the entire population is considered to possibly fall into the local optimum, and 4 groups are split into 8 groups:
Figure BDA0003185069640000102
splitting it into 8 subgroups becomes an 8-dimensional feature selection problem. If there are complementary features in the panel, good results can be achieved in the problem. In addition, random grouping can greatly alleviate the problem of premature convergence of PSOs.
PSO has the ability to quickly detect more productive regions, but his local search capability is not very strong, and a local search algorithm is proposed here.
2.2 local search
In the PSO algorithm, the main idea is to guide the particle with the current pbest and gbest. However, the previous gbest may also contain useful information. Some of the very good features may not be present in gbest at the same time, and these features may be complementary. Therefore, the local search idea I propose is mainly to keep the information of the previous gbest and use them to improve the current gbest.
Suppose SbestIs a collection of all features previously selected by gbest, all features having a score. And constructing a local candidate scheme with the size of p/10 by local search, wherein p is the current population size, and each local candidate scheme is smaller than the size of the selected subset of the current gbest. Using tournament index method to derive from SbestThe greater the probability of selecting a feature, the feature with the higher score. Therefore, the more features that appear in the past gbest appear more frequently, and there is also a possibility that a duplicate feature is selected, so the local candidate may be smaller than the size of the current gbest. When p/10 local candidate schemes are selected, the proxy model instance is used for evaluation, so that the best local candidate feature subset can be quickly found. And finally, comparing the best candidate feature subset with the current gbest, evaluating by using a complete training set example, and setting the more excellent feature subset as the gbest. The pseudo code is as follows:
and algorithm 4: local search algorithm (local _ search)
Parameters are as follows: sbestFeature set of previous gbest | features selected by current gbest
i.FOR i=1:p/10
(1) Selecting | gbest | feature composition subset P by tournamenti
(2) Evaluating the subset P with a proxy instancei
ii.ENDFOR
From P1To Pp/10Selecting a subset P of the top showk
iv. mixing PkCompared with the evaluation of the current gbest based on the original training set
v.Put the features selected by gbest into Sbest
Set the superior party to gbest
Return gbest.
Then the next question is to calculate SbestA fraction of the features present in (a). The main idea is that if the importance of a feature is determined by the frequency of its occurrence in gbest and whether it is present in the current gbest, the score is defined as follows:
scoref=freqf+gcf
wherein:
Figure BDA0003185069640000121
Figure BDA0003185069640000122
the first part of the score is the frequency, measured as the number of occurrences of the feature in the past gbest, which if more occurrences, indicates better quality of the feature. The second part is to set the feature to 1/| gbest |, if it appears in the current gbest, as compared to the current gbest. This allows the feature that has appeared at the current gbest to be preferentially selected if the feature appears the same number of times, so that the last subset of features that is formed does not differ much from the current gbest.
The local search generates a new subset using features that develop the best feature subset of previous iterations and competes with the current gbest. Moreover, it can be seen that the feature subsets selected by local search are all smaller than | gbest |, which indicates that emphasis is placed on reducing some features, and it is more likely to remove some redundant features, which is favorable for the direction of search optimization and also favorable for another goal of feature selection: the size of the feature subset.
Algorithm framework
The flow chart of the whole algorithm is shown in fig. 1, firstly, the population is initialized, the features are randomly grouped, and then, the proxy instance algorithm is used for deleting partial instances in the training set to generate the proxy instance set. Tmax iterations are then performed, in which two phases are again separated. In the first stage, only the rough direction needs to be found in the search space initially, so a proxy instance can be used to replace the original evaluation function, if the local optimum is involved, for example, gbest has not improved in 3 iterations, the local optimum is considered to be involved, the subgroup can be split, the exploration is continued, and if the subgroup is split to the minimum, namely the subgroup number is the data dimension, the second stage is entered. At the second stage, the original data set in the training set is used for evaluation, and local search is used, and all the gbest information at the second stage is stored and used to improve the current gbest.
Specifically, the first stage comprises the following specific steps:
updating the historical optimal position of each particle in the population and the global optimal position of the population;
updating the position of each particle;
evaluating a function fitness value of the particle using the set of proxy instances;
judging whether the population is likely to fall into the local optimum or not, and if the population falls into the local optimum, performing group splitting;
judging whether the group splitting is finished, if so, entering a second stage, and otherwise, repeating the first stage;
the second stage comprises the following specific steps:
updating the historical optimal position of each particle in the population and the global optimal position of the population;
performing a more refined search using a local search algorithm;
updating the position of each particle;
evaluating a function fitness value of the particle using the original instance set;
and judging whether the iteration times reach Tmax times, and if so, ending the iteration.
The algorithm is optimized through two aspects of evaluation and search mechanism, evaluation time is effectively shortened through proxy examples in the aspect of the evaluation mechanism, and the original backbone particle swarm algorithm is improved through optimization of random grouping and local search for high-dimensional data, so that the problem of selection of most high-dimensional features can be solved.
It can be seen that the jump condition between the first and second phases is that the panel splitting is complete, but it was found through several experiments during the course of the experiment that: when some dimensions are high, it may be that after Tmax iteration, it is still in the first stage, and to solve this problem, we stipulate that the second stage must perform at least t ═ Tmax × 0.3 iterations, so as to converge to the excellent region more highly, and achieve better effect in the test set.
The following are experimental results:
data set name Number of features Number of examples Categories
Madelon 500 2600 2
colon 2000 62 2
Lung 3312 203 5
prostate 10509 102 2
Four gene disease data sets of Lung (Lung cancer), colon (colon cancer), Madelon (Madelon disease) and prostate (prostate cancer) are selected. The dimensionality of the data sets is basically between 500-10000, the number of examples is 50-2000, and it can be seen that the proportion difference between the feature number and the number of examples is large, for example, colon, Lung and protate are all the cases with the number being much smaller than the feature number, which makes a great challenge on classification accuracy, while the number of examples of Madelon is far larger than the feature number, and the time spent in calculating the evaluation function is usually very long. The several data sets are selected.
Figure BDA0003185069640000141
Figure BDA0003185069640000151
It can be seen that from the comparison of classification accuracy of the test set and the training set, 8 comparisons are performed on 4 data sets, and our SurBBPSO has advantages on 7 comparisons, and it can be seen that the only data without advantages is second. Compared with BBPSO, the method has the advantages of being improved on 4 data sets and basically achieving the advantages of two classical feature selection algorithms of PSO and NSGA-2. The result shows that during the search phase, compared with the reference algorithm, SurBBPSO performs more effective search, and can find the approximate direction at an early stage through random grouping without falling into local optimality, and then after entering the second phase, the SurBBPSO tries to jump out the local optimality by using the original unified combination and the newly proposed local search, wherein the local search is dedicated to find smaller gbest, and can lead the particle to proceed to the direction of smaller feature subset, which is also a target of feature selection.
From the operation time, due to the existence of the proxy instance algorithm, the time for evaluating the function can be greatly reduced, and the random grouping can save much time in the early iteration stage. The time reduction of Madelon is the most dramatic and is reduced by 4 times, which is similar to that of the data set, and the data set has many instances, namely 2600 instances, so that many instances are deleted, and the effect is also the most obvious. For the other three datasets it was found that the reduction was not very much, for Lung and protate only about ten to twenty percent, while for the colon dataset the run time was rather longer, mainly because the number of instances was not so many, resulting in insignificant effects after the deletion of the instances.
In summary, analyzing and mining the classification characteristics of the sample from the gene expression profile has important biological significance for revealing disease generation and pathological process. The gene expression profile data contains the expression levels of all measurable genes in tissue cells, but only a small number of genes are actually associated with a sample class. Therefore, the invention has stronger exploration generalization capability aiming at high-dimensional gene expression profile data, better describes a characteristic selection model of the gene expression profile data, selects characteristic genes which are effective for sample classification from thousands of genes, and has great exploration significance and practical value for disease classification and clinical medical treatment. The embodiment of the invention provides a method for randomly grouping the dimensions to achieve the purpose of reducing the dimensions, increasing the classification precision and reducing the time of a search mechanism, and meanwhile, deleting part of examples and reducing the number of the examples to achieve the purpose of reducing the time of an evaluation stage. The experimental result on the 500-10000-dimensional gene disease data set shows that the classification precision of the gene data is improved by the provided algorithm, thereby showing that the pathogenic gene is effectively selected.
On the other hand, the invention also discloses a gene data characteristic selection system based on the backbone particle swarm optimization, which comprises the following units,
the initialization unit is used for initializing gene disease data set population and randomly initializing gene characteristics through a random grouping algorithm to be divided into four groups;
the agent instance set setting unit is used for deleting part of instances in the training set by using an agent instance algorithm to generate an agent instance set;
the optimal gene characteristic determining unit is used for carrying out Tmax iteration on each group of gene characteristics through a backbone particle swarm algorithm, wherein the Tmax iteration is divided into two stages, the function adaptive values of the particles are evaluated through a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage utilizes a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene characteristic with the best effect;
and the data output unit is used for outputting the gene characteristics with the best effect.
In a third aspect, the present invention also discloses a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method as described above.
It is understood that the system provided by the embodiment of the present invention corresponds to the method provided by the embodiment of the present invention, and the explanation, the example and the beneficial effects of the related contents can refer to the corresponding parts in the method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A gene data feature selection method based on backbone particle swarm optimization is characterized in that: based on the gene disease data set, the following steps are carried out by a computer device,
s1, initializing gene disease data set populations, and randomly initializing gene characteristics through a random grouping algorithm to divide the gene characteristics into four groups;
s2, deleting part of examples in the training set by using an agent example algorithm to generate an agent example set;
s3, performing Tmax iteration by backbone particle swarm optimization aiming at each group of gene features, wherein in the Tmax iteration, the Tmax iteration is divided into two stages, and the function adaptive values of the particles are evaluated by a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage adopts a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene feature with the best effect;
s4, outputting the gene characteristics with the best effect.
2. The method for selecting gene data features based on backbone particle swarm optimization according to claim 1, wherein: the proxy instance algorithm step in the S2 comprises:
firstly, deleting noise examples in a training set by an algorithm, and if the examples are wrongly classified by k neighbors, deleting the examples as the noise examples;
then, for each remaining instance, calculating an "enemy" distance, i.e., the closest distance of each instance to the different classes of instances, and if the "enemy" distance is greater, the instances can be considered to be farther away from the boundary, and the instances are sorted by the "enemy" distance, and the higher the "enemy" distance is, the higher the enemy distance is, the priority is to be deleted;
finally, for each instance, establishing a nearest neighbor list and an association list, and if the deletion of one instance does not influence the classification of the rest instances in the rest S, deleting the instance; when an instance is deleted, it will delete P from the neighbor list in its association list, and then find a new neighbor, so that there are still k neighbors in their list, and when they find a new neighbor N, they will also add it to the association list of N;
and finally, the deleted S is the proxy instance data set.
3. The method for selecting gene data features based on backbone particle swarm optimization according to claim 2, wherein: the pseudo code of the proxy instance algorithm is as follows:
parameters are as follows: t training set
1)S=T
2) Each instance P in FOR S
(1) If the k neighbor class labels of P are not consistent with the original labels of P, it is considered as a noise example
(2) Deleting P from S
3)ENDFOR
4) Each instance P in FOR S
(1) Find N +1 neighbors N in P1....k+1
(2) Adding P to N + 1N1....k+1In the association list of
5)ENDFOR
6) Calculate the "enemy" distance for each instance
7) Each instance P in FOR S
(1) with the number of instances in the association list of P that were successfully classified if P exists
(2) without is the number of instances in the association list of P that were misclassified if P was deleted
(3)IF without>=with
Removing P from S
② A in the FOR P association list
1) Delete P from neighbor list in A
2) Re-finding A's neighbors
3) Add A to the association list of the new neighbor
③ENDFOR
N neighbors in FOR P
1) Deleting P from the association list of N
⑤ENDFOR
(4)ENDIF
8)ENDFOR
9) And returning to the S.
4. The method for selecting gene data features based on backbone particle swarm optimization according to claim 1, wherein: the random grouping algorithm in S1 includes the following steps,
initializing, randomly numbering the features, and randomly splitting the features into four groups;
in the iterative process, if the population falls into a local optimal situation, the group splitting is carried out, and the new group number is 2 times the original group number.
5. The method for selecting gene data features based on backbone particle swarm optimization according to claim 1, wherein: the first stage in S3 includes the following steps:
updating the historical optimal position of each particle in the population and the global optimal position of the population;
updating the position of each particle;
evaluating a function fitness value of the particle using the set of proxy instances;
judging whether the population is likely to fall into the local optimum or not, and if the population falls into the local optimum, performing group splitting;
judging whether the group splitting is finished, if so, entering a second stage, and otherwise, repeating the first stage;
the second stage comprises the following specific steps:
updating the historical optimal position of each particle in the population and the global optimal position of the population;
performing a more refined search using a local search algorithm;
updating the position of each particle;
evaluating a function fitness value of the particle using the original instance set;
and judging whether the iteration times reach Tmax times, and if so, ending the iteration.
6. The method for selecting gene data features based on backbone particle swarm optimization according to claim 5, wherein: the step of the local search algorithm in S3 includes,
suppose SbestIs the set of all features selected by the previous gbest, all features have a score; constructing a local candidate scheme with the size of p/10 by local search, wherein p is the size of the current population, and each local candidate scheme is smaller than the size of the subset selected by the current gbest;
using tournament index method to derive from SbestSelecting features, wherein the features with higher scores have higher probability of being selected, so that the more features appear in the past gbest, the higher the appearance frequency is, and the more repeated features may appear, so that the local candidate scheme may be smaller than the current gbest in size;
when p/10 local candidate schemes are selected, the proxy model examples are used for evaluation, so that the best local candidate feature subset can be quickly found; and finally, comparing the best candidate feature subset with the current gbest, evaluating by using a complete training set example, and setting the more excellent feature subset as the gbest.
7. The method for selecting gene data features based on backbone particle swarm optimization according to claim 6, wherein: the pseudo code of the local search algorithm is as follows,
parameters are as follows: sbestFeature set of previous gbest | features selected by current gbest
i.FORi=1:p/10
(1) Selecting | gbest | feature composition subset P by tournamenti
(2) Evaluating the subset P with a proxy instancei
ii.ENDFOR
From P1To Pp/10Selecting a subset P of the top showk
iv. mixing PkCompared with the evaluation of the current gbest based on the original training set
v. put the features selected by gbest into Sbest
Set the superior party to gbest
Return gbest.
8. The method for selecting gene data features based on backbone particle swarm optimization according to claim 7, wherein: the local search algorithm further includes calculating SbestThe specific steps of the score of the feature appearing in (1) are as follows,
the scores are defined as follows:
scoref=freqf+gcf
wherein:
Figure FDA0003185069630000041
Figure FDA0003185069630000042
the first part of the score is frequency, measured as the number of occurrences of the feature in the past gbest, if more occurrences, representing the better quality of the feature;
the second part is to set the feature to 1/| gbest |, if it appears in the current gbest, as compared to the current gbest.
9. A gene data feature selection device based on backbone particle swarm optimization is characterized in that: comprises the following units of a first unit, a second unit,
the initialization unit is used for initializing gene disease data set population and randomly initializing gene characteristics through a random grouping algorithm to be divided into four groups;
the agent instance set setting unit is used for deleting part of instances in the training set by using an agent instance algorithm to generate an agent instance set;
the optimal gene characteristic determining unit is used for carrying out Tmax iteration on each group of gene characteristics through a backbone particle swarm algorithm, wherein the Tmax iteration is divided into two stages, the function adaptive values of the particles are evaluated through a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage utilizes a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene characteristic with the best effect;
and the data output unit is used for outputting the gene characteristics with the best effect.
10. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 8.
CN202110858994.0A 2021-07-28 2021-07-28 Gene data characteristic selection method and device based on backbone particle swarm algorithm Active CN113571134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110858994.0A CN113571134B (en) 2021-07-28 2021-07-28 Gene data characteristic selection method and device based on backbone particle swarm algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110858994.0A CN113571134B (en) 2021-07-28 2021-07-28 Gene data characteristic selection method and device based on backbone particle swarm algorithm

Publications (2)

Publication Number Publication Date
CN113571134A true CN113571134A (en) 2021-10-29
CN113571134B CN113571134B (en) 2024-07-02

Family

ID=78168576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110858994.0A Active CN113571134B (en) 2021-07-28 2021-07-28 Gene data characteristic selection method and device based on backbone particle swarm algorithm

Country Status (1)

Country Link
CN (1) CN113571134B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070208677A1 (en) * 2006-01-31 2007-09-06 The Board Of Trustees Of The University Of Illinois Adaptive optimization methods
US20080307399A1 (en) * 2007-06-05 2008-12-11 Motorola, Inc. Gene expression programming based on hidden markov models
US20100040281A1 (en) * 2008-08-12 2010-02-18 Halliburton Energy Services, Inc. Systems and Methods Employing Cooperative Optimization-Based Dimensionality Reduction
US8499001B1 (en) * 2009-11-25 2013-07-30 Quest Software, Inc. Systems and methods for index selection in collections of data
CN103942571A (en) * 2014-03-04 2014-07-23 西安电子科技大学 Graphic image sorting method based on genetic programming algorithm
US8793200B1 (en) * 2009-09-22 2014-07-29 Hrl Laboratories, Llc Method for particle swarm optimization with random walk
US20140257767A1 (en) * 2013-03-09 2014-09-11 Bigwood Technology, Inc. PSO-Guided Trust-Tech Methods for Global Unconstrained Optimization
CN108595499A (en) * 2018-03-18 2018-09-28 西安财经学院 A kind of population cluster High dimensional data analysis method of clone's optimization
CN110188785A (en) * 2019-03-28 2019-08-30 山东浪潮云信息技术有限公司 A kind of data clusters analysis method based on genetic algorithm
CN110334869A (en) * 2019-08-15 2019-10-15 重庆大学 A kind of mangrove forest ecological health forecast training method based on dynamic colony optimization algorithm
CN110674860A (en) * 2019-09-19 2020-01-10 南京邮电大学 Feature selection method based on neighborhood search strategy, storage medium and terminal
CN111723897A (en) * 2020-05-13 2020-09-29 广东工业大学 Multi-modal feature selection method based on particle swarm optimization
CN112070125A (en) * 2020-08-19 2020-12-11 西安理工大学 Prediction method of unbalanced data set based on isolated forest learning
CN112116952A (en) * 2020-08-06 2020-12-22 温州大学 Gene selection method of wolf optimization algorithm based on diffusion and chaotic local search
WO2021022637A1 (en) * 2019-08-06 2021-02-11 南京赛沃夫海洋科技有限公司 Unmanned surface vehicle path planning method and system based on improved genetic algorithm
CN112926837A (en) * 2021-02-04 2021-06-08 郑州轻工业大学 Method for solving job shop scheduling problem based on data-driven improved genetic algorithm
CN113011076A (en) * 2021-03-29 2021-06-22 西安理工大学 Efficient particle swarm optimization method based on RBF proxy model

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070208677A1 (en) * 2006-01-31 2007-09-06 The Board Of Trustees Of The University Of Illinois Adaptive optimization methods
US20080307399A1 (en) * 2007-06-05 2008-12-11 Motorola, Inc. Gene expression programming based on hidden markov models
US20100040281A1 (en) * 2008-08-12 2010-02-18 Halliburton Energy Services, Inc. Systems and Methods Employing Cooperative Optimization-Based Dimensionality Reduction
US8793200B1 (en) * 2009-09-22 2014-07-29 Hrl Laboratories, Llc Method for particle swarm optimization with random walk
US8499001B1 (en) * 2009-11-25 2013-07-30 Quest Software, Inc. Systems and methods for index selection in collections of data
US20140257767A1 (en) * 2013-03-09 2014-09-11 Bigwood Technology, Inc. PSO-Guided Trust-Tech Methods for Global Unconstrained Optimization
CN103942571A (en) * 2014-03-04 2014-07-23 西安电子科技大学 Graphic image sorting method based on genetic programming algorithm
CN108595499A (en) * 2018-03-18 2018-09-28 西安财经学院 A kind of population cluster High dimensional data analysis method of clone's optimization
CN110188785A (en) * 2019-03-28 2019-08-30 山东浪潮云信息技术有限公司 A kind of data clusters analysis method based on genetic algorithm
WO2021022637A1 (en) * 2019-08-06 2021-02-11 南京赛沃夫海洋科技有限公司 Unmanned surface vehicle path planning method and system based on improved genetic algorithm
CN110334869A (en) * 2019-08-15 2019-10-15 重庆大学 A kind of mangrove forest ecological health forecast training method based on dynamic colony optimization algorithm
CN110674860A (en) * 2019-09-19 2020-01-10 南京邮电大学 Feature selection method based on neighborhood search strategy, storage medium and terminal
CN111723897A (en) * 2020-05-13 2020-09-29 广东工业大学 Multi-modal feature selection method based on particle swarm optimization
CN112116952A (en) * 2020-08-06 2020-12-22 温州大学 Gene selection method of wolf optimization algorithm based on diffusion and chaotic local search
CN112070125A (en) * 2020-08-19 2020-12-11 西安理工大学 Prediction method of unbalanced data set based on isolated forest learning
CN112926837A (en) * 2021-02-04 2021-06-08 郑州轻工业大学 Method for solving job shop scheduling problem based on data-driven improved genetic algorithm
CN113011076A (en) * 2021-03-29 2021-06-22 西安理工大学 Efficient particle swarm optimization method based on RBF proxy model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
邓秀勤;李文洲;武继刚;刘太亨;: "融合Shapley值和粒子群优化算法的混合特征选择算法", 计算机应用, no. 05 *
雷阳;李树荣;张强;张晓东;: "基于骨干粒子群的混合遗传算法及其应用", 计算机工程与应用, no. 36 *

Also Published As

Publication number Publication date
CN113571134B (en) 2024-07-02

Similar Documents

Publication Publication Date Title
Banerjee et al. Evolutionary rough feature selection in gene expression data
Alomari et al. A hybrid filter-wrapper gene selection method for cancer classification
CN111368891B (en) K-Means text classification method based on immune clone gray wolf optimization algorithm
CN108681660A (en) A kind of non-coding RNA based on association rule mining and disease relationship prediction technique
CN107169983B (en) Multi-threshold image segmentation method based on cross variation artificial fish swarm algorithm
Nagpal et al. A feature selection algorithm based on qualitative mutual information for cancer microarray data
İni̇k et al. MODE-CNN: A fast converging multi-objective optimization algorithm for CNN-based models
Saha et al. Simultaneous feature selection and symmetry based clustering using multiobjective framework
CN116842374A (en) Preprocessing method and device for complex multi-label medical data containing interrelationships
CN113283573A (en) Automatic search method for optimal structure of convolutional neural network
Kulikov et al. Machine learning can be as good as maximum likelihood when reconstructing phylogenetic trees and determining the best evolutionary model on four taxon alignments
CN111782904B (en) Unbalanced data set processing method and system based on improved SMOTE algorithm
CN113571134A (en) Method and device for selecting gene data characteristics based on backbone particle swarm optimization
CN111488903A (en) Decision tree feature selection method based on feature weight
CN113780334B (en) High-dimensional data classification method based on two-stage mixed feature selection
Al-Baity et al. A New Optimized Wrapper Gene Selection Method for Breast Cancer Prediction.
CN115019898A (en) Eutectic prediction method based on deep forest
Liang et al. ASE: Anomaly Scoring Based Ensemble Learning for Imbalanced Datasets
Majeed et al. A comparison between the performance of features selection techniques: survey study
CN112308160A (en) K-means clustering artificial intelligence optimization algorithm
Wali et al. m-CALP–Yet another way of generating handwritten data through evolution for pattern recognition
Nematzadeh et al. Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets
Anaraki et al. A Fuzzy-Rough Feature Selection Based on Binary Shuffled Frog Leaping Algorithm
Tran Improving the performance of imputation methods for gene expression classification using feature selection
Kuzudisli et al. RCE-IFE: Recursive Cluster Elimination with Intra-cluster Feature Elimination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant