CN113571134A

CN113571134A - Method and device for selecting gene data characteristics based on backbone particle swarm optimization

Info

Publication number: CN113571134A
Application number: CN202110858994.0A
Authority: CN
Inventors: 许镇义; 潘凯; 程凡; 康宇; 曹洋
Original assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Current assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-10-29
Anticipated expiration: 2041-07-28
Also published as: CN113571134B

Abstract

The invention relates to a method and a device for selecting gene data characteristics based on a backbone particle swarm algorithm, which are based on a gene disease data set, carry out initialization on gene disease data set population by initializing the gene disease data set, and carry out random initialization on gene characteristics by a random grouping algorithm to divide the gene characteristics into four groups; deleting part of examples in the training set by using an agent example algorithm to generate an agent example set; and performing Tmax iteration on each group of gene features through a backbone particle swarm algorithm, wherein the Tmax iteration is divided into two stages, and evaluating the function adaptive value of the particles through a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction to accelerate optimization speed and find an optimization direction in global search, the second stage adopts a local search algorithm to enable the population to find an optimal solution after finding the optimization direction, and the optimal solution is that the gene features with the best effect are found and output. The invention improves the classification precision of gene data and obtains good effect on a plurality of disease gene data sets.

Description

Method and device for selecting gene data characteristics based on backbone particle swarm optimization

Technical Field

The invention relates to the technical field of large-scale feature selection, in particular to a method and a device for selecting gene data features based on a backbone particle swarm algorithm.

Background

In data mining and machine learning, a large number of features are typically involved, but not all features are essential, as many of them are redundant or even redundant, which may degrade the performance of the algorithm, the choice being aimed at solving the problem by selecting a subset from the original set of features. Feature selection is a challenging task, however, mainly because the search space grows exponentially as the dimensions increase, with evolutionary algorithms being known for global search capabilities. For large-scale feature selection, the main challenge is that the number of instances is large, so that the evaluation stage is long in calculation time, and the classification precision is low due to high dimensionality.

However, due to the characteristics of small sample quantity, high dimension, high noise, high redundancy and the like of gene expression profile data, great difficulty is brought to deep and accurate mining of biomedical knowledge and tumor information gene selection contained in the gene expression profile. The gene expression profile data contains the expression levels of all measurable genes in tissue cells, but only a small number of genes are actually associated with a sample class.

Disclosure of Invention

The invention provides a backbone particle swarm algorithm-based gene data feature selection method, which can solve the technical problems and takes a backbone particle swarm algorithm as a framework, in particular to a proxy example algorithm, a random grouping algorithm and a random grouping device.

In order to achieve the purpose, the invention adopts the following technical scheme:

a gene data feature selection method based on a backbone particle swarm algorithm comprises the following steps:

based on the gene disease data set, the following steps are carried out by a computer device,

s1, initializing gene disease data set populations, and randomly initializing gene characteristics through a random grouping algorithm to divide the gene characteristics into four groups;

s2, deleting part of examples in the training set by using an agent example algorithm to generate an agent example set;

s3, performing Tmax iteration by backbone particle swarm optimization aiming at each group of gene features, wherein in the Tmax iteration, the Tmax iteration is divided into two stages, and the function adaptive values of the particles are evaluated by a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage adopts a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene feature with the best effect;

s4, outputting the gene characteristics with the best effect.

Further, the step of the proxy instance algorithm in S2 includes:

firstly, deleting noise examples in a training set by an algorithm, and if the examples are wrongly classified by k neighbors, deleting the examples as the noise examples;

then, for each remaining instance, calculating an "enemy" distance, i.e., the closest distance of each instance to the different classes of instances, and if the "enemy" distance is greater, the instances can be considered to be farther away from the boundary, and the instances are sorted by the "enemy" distance, and the higher the "enemy" distance is, the higher the enemy distance is, the priority is to be deleted;

finally, for each instance, establishing a nearest neighbor list and an association list, and if the deletion of one instance does not influence the classification of the rest instances in the rest S, deleting the instance; when an instance is deleted, it will delete P from the neighbor list in its association list, and then find a new neighbor, so that there are still k neighbors in their list, and when they find a new neighbor N, they will also add it to the association list of N;

and finally, the deleted S is the proxy instance data set.

Further, the pseudo code of the proxy instance algorithm is as follows:

parameters are as follows: t training set

10)S＝T

11) Each instance P in FOR S

(1) If the k neighbor class labels of P are not consistent with the original labels of P, it is considered as a noise example

(2) Deleting P from S

12)ENDFOR

13) Each instance P in FOR S

(1) Find N +1 neighbors N in P_1....k+1

(2) Adding P to N + 1N_1....k+1In the association list of

14)ENDFOR

15) Calculate the "enemy" distance for each instance

16) Each instance P in FOR S

(1) with the number of instances in the association list of P that were successfully classified if P exists

(2) without is the number of instances in the association list of P that were misclassified if P was deleted

(3)IF without>＝with

Removing P from S

② A in the FOR P association list

1) Delete P from neighbor list in A

2) Re-finding A's neighbors

3) Add A to the association list of the new neighbor

③ENDFOR

N neighbors in FOR P

1) Deleting P from the association list of N

⑤ENDFOR

(4)ENDIF

17)ENDFOR

18) And returning to the S.

Further, the random grouping algorithm in S1 includes the following steps,

initializing, randomly numbering the features, and randomly splitting the features into four groups;

in the iterative process, if the population falls into a local optimal situation, the group splitting is carried out, and the new group number is 2 times the original group number.

Further, in the first stage of S3, only the specified direction needs to be found in the search space initially, so that the proxy instance is used to replace the original evaluation function, if the local optimum is trapped, the small group is split, the exploration is continued, and if the small group is split to the minimum, that is, the number of the small group is the data dimension, the second stage is entered;

in the second stage, the original data set in the training set is used for evaluation, and a local search algorithm is used, all the gbest information in the second stage is stored, and the current gbest is improved by using the gbest information.

Further, the step of the local search algorithm in S3 includes,

suppose S_bestIs the set of all features selected by the previous gbest, all features have a score; constructing a local candidate scheme with the size of p/10 by local search, wherein p is the size of the current population, and each local candidate scheme is smaller than the size of the subset selected by the current gbest;

using tournament index method to derive from S_bestSelecting features, wherein the features with higher scores have higher probability of being selected, so that the more features appear in the past gbest, the higher the appearance frequency is, and the more repeated features may appear, so that the local candidate scheme may be smaller than the current gbest in size;

when p/10 local candidate schemes are selected, the proxy model examples are used for evaluation, so that the best local candidate feature subset can be quickly found; and finally, comparing the best candidate feature subset with the current gbest, evaluating by using a complete training set example, and setting the more excellent feature subset as the gbest.

Further, the pseudo code of the local search algorithm is as follows,

parameters are as follows: s_bestFeature set of previous gbest | features selected by current gbest

i.FOR i＝1：p/10

(1) Selecting | gbest | feature composition subset P by tournament_i

(2) Evaluating the subset P with a proxy instance_i

ii.ENDFOR

From P₁To P_p/10Selecting a subset P of the top show_k

iv. mixing P_kCompared with the evaluation of the current gbest based on the original training set

v. put the features selected by gbest into S_best

Set the superior party to gbest

Return gbest.

Further, the local search algorithm further includes calculating S_bestThe specific steps of the score of the feature appearing in (1) are as follows,

the scores are defined as follows:

score_f＝freq_f+gc_f

wherein:

the first part of the score is frequency, measured as the number of occurrences of the feature in the past gbest, if more occurrences, representing the better quality of the feature;

the second part is to set the feature to 1/| gbest |, if it appears in the current gbest, as compared to the current gbest.

On the other hand, the invention also discloses a gene data characteristic selection system based on the backbone particle swarm optimization, which comprises the following units,

the initialization unit is used for initializing gene disease data set population and randomly initializing gene characteristics through a random grouping algorithm to be divided into four groups;

the agent instance set setting unit is used for deleting part of instances in the training set by using an agent instance algorithm to generate an agent instance set;

the optimal gene characteristic determining unit is used for carrying out Tmax iteration on each group of gene characteristics through a backbone particle swarm algorithm, wherein the Tmax iteration is divided into two stages, the function adaptive values of the particles are evaluated through a proxy example set and an original example set respectively, the first stage adopts a grouping algorithm to enable data to be subjected to dimensionality reduction, the optimization speed is accelerated, the optimization direction is found in global search, the second stage utilizes a local search algorithm to enable the population to find the optimal solution after finding the optimization direction, and the optimal solution is the gene characteristic with the best effect;

and the data output unit is used for outputting the gene characteristics with the best effect.

In a third aspect, the present invention also discloses a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method as described above.

According to the technical scheme, aiming at the defects of long training time, low classification precision and the like of the current characteristic gene selection algorithm, the invention designs the high-efficiency characteristic gene selection and classification related algorithm, selects fewer characteristic genes and keeps or even improves the classification precision of the related model. The method is based on a backbone particle swarm optimization (BBPSO), and is optimized from two aspects of an evaluation function and a search mechanism. For an evaluation function, deleting a part of examples by using a proxy example algorithm in a first stage to reduce the time when KNN evaluation is achieved; for a search mechanism, in the first stage, the search mechanism is divided into four groups by random grouping to carry out iterative evolution, if the search mechanism is trapped in local optimum, the groups are divided and then iterated until the search mechanism cannot be divided into four groups, and the search mechanism enters the second stage, so that the operation cost can be reduced in the early stage of the search stage, the optimization direction of particles can be quickly found, and the search mechanism can gradually approach the target in the second stage. The algorithm of the present invention achieves good results on multiple disease gene data sets.

The invention provides a method for randomly grouping dimensions by taking a backbone particle swarm algorithm as a frame so as to achieve the purpose of reducing the dimensions, increase the classification precision and reduce the time of a search mechanism, and simultaneously delete part of examples and reduce the number of the examples so as to reduce the time of an evaluation stage. The experimental result on the 500-10000-dimensional gene disease data set shows that the classification precision of the gene data is improved by the provided algorithm, thereby showing that the pathogenic gene is effectively selected.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.

As shown in fig. 1, the method for selecting gene data features based on backbone particle swarm optimization described in this embodiment includes removing a part of instances by a proxy instance algorithm to form a proxy instance set, so as to reduce evaluation time; in the first stage, the whole population is still in the global large-scale search, a rough optimization direction needs to be found, the dimension reduction data is subjected to evolutionary computation, and an agent instance set is used for evaluation. If the system falls into the local optimum for a period of time, carrying out subgroup splitting until the system cannot be subdivided, and entering a second stage; in the second stage, after the population finds the optimization direction, more detailed search is carried out, and all information which is once the global optimal particle (gbest) is stored while evolution calculation is carried out, so as to guide the current global optimal particle (gbest). Also, the raw data set is used at this stage for evaluation.

Specifically, the embodiment of the invention comprises the following steps:

s4, outputting the gene characteristics with the best effect.

The following are specifically described:

first, the following proxy instance algorithm is introduced

In most feature selection classification algorithms, the K-NN algorithm is used as a classifier. However, because the distance between each particle and each particle needs to be calculated and all training examples need to be stored, a large memory requirement is required. And all instances must be searched to classify the input vector, so the speed is slow in the classification process. In addition, some noise instances also reduce generalization accuracy.

In the embodiment of the present invention, an example selection algorithm is used to select a few representative examples to form a new example set, which is called a proxy training set, where the proxy training set includes information in the original training set, and when calculating the fitness, the proxy training set may be used to estimate the calculated fitness, so as to achieve the purpose of reducing the calculation time.

The whole agent instance algorithm is as follows: first, noise instances are removed, and instances that are misclassified by the k nearest neighbors are removed, which eliminates the noise instances, and some boundary instances are removed, which can smooth the decision boundary, which helps avoid overfitting the data; next, those central instances are deleted, mainly because the classification performance of removing the central instance's impact is lower than that of removing the boundary instance, and for each instance, if deleting an instance does not degrade the correct classification of its neighbor instances, then that instance will be deleted. The specific pseudo code is as follows:

proxy instance algorithm (correct _ Distance):

parameters are as follows: t training set

19)S＝T

20) Each instance P in FOR S

(2) Deleting P from S

21)ENDFOR

22) Each instance P in FOR S

(1) Find N +1 neighbors N in P_1....k+1

(2) Adding P to N + 1N_1....k+1In the association list of

23)ENDFOR

24) Calculate the "enemy" distance for each instance

25) Each instance P in FOR S

(3)IF without>＝with

Removing P from S

② A in the FOR P association list

1) Delete P from neighbor list in A

2) Re-finding A's neighbors

3) Add A to the association list of the new neighbor

③ENDFOR

N neighbors in FOR P

1) Deleting P from the association list of N

⑤ENDFOR

(4)ENDIF

26)ENDFOR

27) And returning to the S.

First, the algorithm deletes the noisy instances first, which it considers as a noisy instance to delete if the instance is misclassified by k neighbors. Subsequently, for each remaining instance, an "enemy" distance, i.e., the closest distance of each instance to the heterogeneous instances, is calculated, and if the "enemy" distance is greater, the instances can be considered to be further from the boundary, and the instances are sorted by the "enemy" distance, which will be preferentially deleted. Finally, for each instance, a nearest neighbor list is established, as well as an association list. An instance is deleted if its deletion does not affect the classification of the remaining instances in the remaining S. When an instance is deleted, it deletes P from the neighbor list in its association list and then finds a new neighbor so that there are still k neighbors in their list, and when they find a new neighbor N, they also add it to the association list of N. And finally, the deleted S is the proxy instance data set.

In PSO, the entire search space is explored by the population to locate the promising space. In populations early in the iteration, the location where the global optimal solution is likely can be estimated, and so can be evaluated with an alternative example dataset. In the later stages of the iteration, the entire training set will be used.

Random grouping algorithm

Random grouping

This subsection mainly teaches optimization of the algorithm at the search stage, the search space for feature selection grows exponentially with the number of features, when high-dimensional data is applied to feature selection, a very large memory and computation time are usually required, and the complex search space for high-dimensional data also brings a great challenge to the conventional PSO.

The random grouping algorithm rules are as follows:

For example, the following table:

a0 in the first row represents that the feature subset does not select the feature, and a 1 represents that the feature subset selects the feature. It can be seen that after random grouping, the indexes of each dimension are disturbed, and assuming that the indexes are initially divided into 4 groups, the indexes of 1, 5, 9 and 13 are all set to 1, and similarly, the indexes of 3, 7, 11 and 15 are all set to 0, so that the original 16-dimensional feature selection problem is reduced to a 4-dimensional feature selection problem.

By random grouping, the original expensive search space is reduced to the problem of feature selection with lower dimension, the direction of population optimization evolution can be found at an early stage, and a large amount of search time can be reduced at the early stage. Once the population falls into the local optimum, the group splitting is performed, and as described above, once the gbest is not updated in the p generation, the entire population is considered to possibly fall into the local optimum, and 4 groups are split into 8 groups:

splitting it into 8 subgroups becomes an 8-dimensional feature selection problem. If there are complementary features in the panel, good results can be achieved in the problem. In addition, random grouping can greatly alleviate the problem of premature convergence of PSOs.

PSO has the ability to quickly detect more productive regions, but his local search capability is not very strong, and a local search algorithm is proposed here.

2.2 local search

In the PSO algorithm, the main idea is to guide the particle with the current pbest and gbest. However, the previous gbest may also contain useful information. Some of the very good features may not be present in gbest at the same time, and these features may be complementary. Therefore, the local search idea I propose is mainly to keep the information of the previous gbest and use them to improve the current gbest.

Suppose S_bestIs a collection of all features previously selected by gbest, all features having a score. And constructing a local candidate scheme with the size of p/10 by local search, wherein p is the current population size, and each local candidate scheme is smaller than the size of the selected subset of the current gbest. Using tournament index method to derive from S_bestThe greater the probability of selecting a feature, the feature with the higher score. Therefore, the more features that appear in the past gbest appear more frequently, and there is also a possibility that a duplicate feature is selected, so the local candidate may be smaller than the size of the current gbest. When p/10 local candidate schemes are selected, the proxy model instance is used for evaluation, so that the best local candidate feature subset can be quickly found. And finally, comparing the best candidate feature subset with the current gbest, evaluating by using a complete training set example, and setting the more excellent feature subset as the gbest. The pseudo code is as follows:

and algorithm 4: local search algorithm (local _ search)

i.FOR i＝1：p/10

(1) Selecting | gbest | feature composition subset P by tournament_i

(2) Evaluating the subset P with a proxy instance_i

ii.ENDFOR

From P₁To P_p/10Selecting a subset P of the top show_k

v.Put the features selected by gbest into S_best

Set the superior party to gbest

Return gbest.

Then the next question is to calculate S_bestA fraction of the features present in (a). The main idea is that if the importance of a feature is determined by the frequency of its occurrence in gbest and whether it is present in the current gbest, the score is defined as follows:

score_f＝freq_f+gc_f

wherein:

the first part of the score is the frequency, measured as the number of occurrences of the feature in the past gbest, which if more occurrences, indicates better quality of the feature. The second part is to set the feature to 1/| gbest |, if it appears in the current gbest, as compared to the current gbest. This allows the feature that has appeared at the current gbest to be preferentially selected if the feature appears the same number of times, so that the last subset of features that is formed does not differ much from the current gbest.

The local search generates a new subset using features that develop the best feature subset of previous iterations and competes with the current gbest. Moreover, it can be seen that the feature subsets selected by local search are all smaller than | gbest |, which indicates that emphasis is placed on reducing some features, and it is more likely to remove some redundant features, which is favorable for the direction of search optimization and also favorable for another goal of feature selection: the size of the feature subset.

Algorithm framework

The flow chart of the whole algorithm is shown in fig. 1, firstly, the population is initialized, the features are randomly grouped, and then, the proxy instance algorithm is used for deleting partial instances in the training set to generate the proxy instance set. Tmax iterations are then performed, in which two phases are again separated. In the first stage, only the rough direction needs to be found in the search space initially, so a proxy instance can be used to replace the original evaluation function, if the local optimum is involved, for example, gbest has not improved in 3 iterations, the local optimum is considered to be involved, the subgroup can be split, the exploration is continued, and if the subgroup is split to the minimum, namely the subgroup number is the data dimension, the second stage is entered. At the second stage, the original data set in the training set is used for evaluation, and local search is used, and all the gbest information at the second stage is stored and used to improve the current gbest.

Specifically, the first stage comprises the following specific steps:

updating the historical optimal position of each particle in the population and the global optimal position of the population;

updating the position of each particle;

evaluating a function fitness value of the particle using the set of proxy instances;

judging whether the population is likely to fall into the local optimum or not, and if the population falls into the local optimum, performing group splitting;

judging whether the group splitting is finished, if so, entering a second stage, and otherwise, repeating the first stage;

the second stage comprises the following specific steps:

performing a more refined search using a local search algorithm;

updating the position of each particle;

evaluating a function fitness value of the particle using the original instance set;

and judging whether the iteration times reach Tmax times, and if so, ending the iteration.

The algorithm is optimized through two aspects of evaluation and search mechanism, evaluation time is effectively shortened through proxy examples in the aspect of the evaluation mechanism, and the original backbone particle swarm algorithm is improved through optimization of random grouping and local search for high-dimensional data, so that the problem of selection of most high-dimensional features can be solved.

It can be seen that the jump condition between the first and second phases is that the panel splitting is complete, but it was found through several experiments during the course of the experiment that: when some dimensions are high, it may be that after Tmax iteration, it is still in the first stage, and to solve this problem, we stipulate that the second stage must perform at least t ═ Tmax × 0.3 iterations, so as to converge to the excellent region more highly, and achieve better effect in the test set.

The following are experimental results:

data set name	Number of features	Number of examples	Categories
				Madelon	500	2600	2
colon	2000	62	2
				Lung	3312	203	5
prostate	10509	102	2

Four gene disease data sets of Lung (Lung cancer), colon (colon cancer), Madelon (Madelon disease) and prostate (prostate cancer) are selected. The dimensionality of the data sets is basically between 500-10000, the number of examples is 50-2000, and it can be seen that the proportion difference between the feature number and the number of examples is large, for example, colon, Lung and protate are all the cases with the number being much smaller than the feature number, which makes a great challenge on classification accuracy, while the number of examples of Madelon is far larger than the feature number, and the time spent in calculating the evaluation function is usually very long. The several data sets are selected.

It can be seen that from the comparison of classification accuracy of the test set and the training set, 8 comparisons are performed on 4 data sets, and our SurBBPSO has advantages on 7 comparisons, and it can be seen that the only data without advantages is second. Compared with BBPSO, the method has the advantages of being improved on 4 data sets and basically achieving the advantages of two classical feature selection algorithms of PSO and NSGA-2. The result shows that during the search phase, compared with the reference algorithm, SurBBPSO performs more effective search, and can find the approximate direction at an early stage through random grouping without falling into local optimality, and then after entering the second phase, the SurBBPSO tries to jump out the local optimality by using the original unified combination and the newly proposed local search, wherein the local search is dedicated to find smaller gbest, and can lead the particle to proceed to the direction of smaller feature subset, which is also a target of feature selection.

From the operation time, due to the existence of the proxy instance algorithm, the time for evaluating the function can be greatly reduced, and the random grouping can save much time in the early iteration stage. The time reduction of Madelon is the most dramatic and is reduced by 4 times, which is similar to that of the data set, and the data set has many instances, namely 2600 instances, so that many instances are deleted, and the effect is also the most obvious. For the other three datasets it was found that the reduction was not very much, for Lung and protate only about ten to twenty percent, while for the colon dataset the run time was rather longer, mainly because the number of instances was not so many, resulting in insignificant effects after the deletion of the instances.

In summary, analyzing and mining the classification characteristics of the sample from the gene expression profile has important biological significance for revealing disease generation and pathological process. The gene expression profile data contains the expression levels of all measurable genes in tissue cells, but only a small number of genes are actually associated with a sample class. Therefore, the invention has stronger exploration generalization capability aiming at high-dimensional gene expression profile data, better describes a characteristic selection model of the gene expression profile data, selects characteristic genes which are effective for sample classification from thousands of genes, and has great exploration significance and practical value for disease classification and clinical medical treatment. The embodiment of the invention provides a method for randomly grouping the dimensions to achieve the purpose of reducing the dimensions, increasing the classification precision and reducing the time of a search mechanism, and meanwhile, deleting part of examples and reducing the number of the examples to achieve the purpose of reducing the time of an evaluation stage. The experimental result on the 500-10000-dimensional gene disease data set shows that the classification precision of the gene data is improved by the provided algorithm, thereby showing that the pathogenic gene is effectively selected.

It is understood that the system provided by the embodiment of the present invention corresponds to the method provided by the embodiment of the present invention, and the explanation, the example and the beneficial effects of the related contents can refer to the corresponding parts in the method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A gene data feature selection method based on backbone particle swarm optimization is characterized in that: based on the gene disease data set, the following steps are carried out by a computer device,

s4, outputting the gene characteristics with the best effect.

2. The method for selecting gene data features based on backbone particle swarm optimization according to claim 1, wherein: the proxy instance algorithm step in the S2 comprises:

and finally, the deleted S is the proxy instance data set.

3. The method for selecting gene data features based on backbone particle swarm optimization according to claim 2, wherein: the pseudo code of the proxy instance algorithm is as follows:

parameters are as follows: t training set

1)S＝T

2) Each instance P in FOR S

(2) Deleting P from S

3)ENDFOR

4) Each instance P in FOR S

(1) Find N +1 neighbors N in P_1....k+1

(2) Adding P to N + 1N_1....k+1In the association list of

5)ENDFOR

6) Calculate the "enemy" distance for each instance

7) Each instance P in FOR S

(3)IF without>＝with

Removing P from S

② A in the FOR P association list

1) Delete P from neighbor list in A

2) Re-finding A's neighbors

3) Add A to the association list of the new neighbor

③ENDFOR

N neighbors in FOR P

1) Deleting P from the association list of N

⑤ENDFOR

(4)ENDIF

8)ENDFOR

9) And returning to the S.

4. The method for selecting gene data features based on backbone particle swarm optimization according to claim 1, wherein: the random grouping algorithm in S1 includes the following steps,

5. The method for selecting gene data features based on backbone particle swarm optimization according to claim 1, wherein: the first stage in S3 includes the following steps:

updating the position of each particle;

the second stage comprises the following specific steps:

performing a more refined search using a local search algorithm;

updating the position of each particle;

6. The method for selecting gene data features based on backbone particle swarm optimization according to claim 5, wherein: the step of the local search algorithm in S3 includes,

7. The method for selecting gene data features based on backbone particle swarm optimization according to claim 6, wherein: the pseudo code of the local search algorithm is as follows,

i.FORi＝1：p/10

(1) Selecting | gbest | feature composition subset P by tournament_i

(2) Evaluating the subset P with a proxy instance_i

ii.ENDFOR

From P₁To P_p/10Selecting a subset P of the top show_k

v. put the features selected by gbest into S_best

Set the superior party to gbest

Return gbest.

8. The method for selecting gene data features based on backbone particle swarm optimization according to claim 7, wherein: the local search algorithm further includes calculating S_bestThe specific steps of the score of the feature appearing in (1) are as follows,

the scores are defined as follows:

score_f＝freq_f+gc_f

wherein:

9. A gene data feature selection device based on backbone particle swarm optimization is characterized in that: comprises the following units of a first unit, a second unit,

10. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 8.