CN108805162A

CN108805162A - A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing

Info

Publication number: CN108805162A
Application number: CN201810380973.0A
Authority: CN
Inventors: 孙林; 郑瑞丽; 张倩倩; 申陈海; 靳瑞霞; 刘艳; 王蓝莹; 殷腾宇; 赵婧; 秦小营; 王学敏
Original assignee: Henan Normal University
Current assignee: Henan Normal University
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2018-11-13

Abstract

The present invention relates to a kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing, the interpretational criteria function of candidate feature subset is constructed by the correlation between redundancy, label and the label between correlation, feature and the feature between saccharomycete feature and label, as discrete particle cluster adaptation of methods degree function, to select optimal character subset according to concentration from yeast count.The present invention not only can effective selected characteristic subset, for follow-up work provide one simplify, accurate character subset improves classification performance moreover, effectively reducing the time complexity and computation complexity of grader.

Description

A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing

Technical field

The invention belongs to technical field of data processing, and in particular to a kind of saccharomycete multiple labeling based on particle group optimizing is special Levy selection method and device.

Background technology

In traditional supervised learning frame, one and only one category label of each learning object, and be between marking It is mutually exclusive and independent.For example, in Gender Classification forecasting problem, " gender " label is only existed, and mark value is wanted It is " man " or being " female ", there is no mark value overlappings.However an individual label can not in actual life The accurately complicated object of description, an object may be related to multiple category labels, and correlation is also likely to be present between label. For example, in text classification, the news report of one entitled " Yang Shuan talks the Olympic Games and prepares " can be classified into " sport ", " hand over It is logical ", " weather ", " economy " and " politics " plate in；In image classification, a sub-picture, may with " sandy beach ", " sea ", Multiple semantic markers such as " coco " are related；In addition, in music emotion analysis, according to the difference of the emotion of expression, a first song There can be the labels such as " cheerful and light-hearted ", " sorrow ", " homesickness " simultaneously.The object for possessing multiple labels is seen everywhere in life, because This multiple labeling classification in recent years causes the extensive research and concern of people.

Bioinformatics is that Multi-label learning applies wider field.Yeast S cell cycle expression data are one A common Multi-label learning data and a typical bioinformatics task, task be predict these saccharomycete with Mark whether correlation in 14 Directory of Features.Often exist in such application, between label certain hierarchical structure and Found by domain expert, for example, the Directory of Features of tree and directed acyclic graph structures gene topological structure etc., Therefore, it needs to utilize the relationship between these labels well in application Multi-label learning technology.

There are a series of challenges for saccharomycete function prediction：The possible category label of each saccharomycete sample of one side is very It is more, and there is certain correlation between these labels, therefore the correlation between considering to mark is needed in Multi-label learning Property；On the other hand, since yeast count is according to being all to have the gene order of higher-dimension to describe, saccharomycete sample has quantity big The high two major features with vector dimension determine that saccharomycete feature selecting is that a run time and space complexity are all very high Machine Learning Problems, the understanding and modeling that the excessively high dimension of these data influences and restricts us to data.In the prior art In, there is the feature selection approach of some yeast count evidences.For example, some are to forecast the embedded feature selection approach of risk Based on, by evaluating each feature, finally obtain optimal feature subset.This method and grader and evaluation index are close Cut phase is closed, it is likely that leads to calculate that the time is longer, dimensionality reduction efficiency is low.

Invention content

The purpose of the present invention is to provide a kind of saccharomycete multiple labeling feature selection approach and dress based on particle group optimizing It sets, to solve the problems, such as that the calculating time of feature selection approach in the prior art is long, efficiency is low.

In order to solve the above technical problems, the technical scheme is that：

The present invention provides a kind of saccharomycete multiple labeling feature selection approach based on particle group optimizing, including walk as follows Suddenly：

Extract saccharomycete sample data set, the saccharomycete sample data set include multiple saccharomycete sample characteristics matrixes and Sample labeling matrix；

The characteristic for extracting saccharomycete sample data set, initializes binary-coded population；And initialize particle The position and speed of group；

By between correlation, label and the label between redundancy, feature and the label between measures characteristic and feature Correlation, construct binding marker correlation CFS interpretational criteria functions；

According to the CFS evaluation functions of the binding marker correlation, the adaptive value of each particle is calculated；

To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if better than warp The optimal location pbest gone through, then the optimal location pbest lived through the adaptive value of the calculating as it；

And using the optimal location pbest of all particles as the optimal location gbest of group；

The position and speed of more new particle is iterated, and the optimal location gbest intermediate values of finally obtained group are right for 1 The feature answered, the as optimal feature subset of saccharomycete data set.

Further, the position and speed of the more new particle includes：

Judge whether to meet t < γ N_iter, wherein random numbers of the γ between [0,1], N_iterFor iteration total degree；

If t < γ N_iter, then the position that j i-th of particle of dimension is updated in the t times iteration is：

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1, 2,…,N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value； For logistic functions, according to the speed of particleProvide the position of particle；

Otherwise, the position that j i-th of particle of dimension is updated in the t times iteration is：

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1, 2,…,N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value； For logistic functions, according to the speed of particleProvide particle Position.

Ability of searching optimum is needed in early period, the later stage needs local search ability, therefore is directed to different situations, using difference Formula carry out the position and speed of more new particle.

Further, the CFS evaluation functions of the binding marker correlation are：

Wherein, CFS (S) is the evaluation of estimate of the candidate subset S comprising k feature；For saccharomycete candidate feature subset S with Average correlation between label sets L,For the average correlation between saccharomycete label sets L and label sets L,For yeast Average redundancy in bacterium candidate feature subset S between feature.

Further, further include the steps that the control of population that position is 1 it is n before the adaptive value for calculating each particle：

Count the positional number h that position is 1 in each particle：

If h > n, the position that h-n value is 1 is changed to 0 at random；

If h < n, the position that n-h value is 0 is changed to 1 at random.

The present invention also provides a kind of saccharomycete multiple labeling feature selecting device based on particle group optimizing, including processing Device, the processor realize following method for executing instruction：

The characteristic for extracting saccharomycete sample data set, initializes binary-coded population, and initialize particle The position and speed of group；

By the correlation between redundancy, feature and the label between measures characteristic and feature, between label and label Correlation, construct binding marker correlation CFS interpretational criteria functions；

To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if better than most Excellent position pbest, then the optimal location pbest lived through the adaptive value of the calculating as it；

The position and speed of more new particle is iterated, and the optimal location gbest intermediate values of finally obtained group are right for 1 The feature answered is the optimal feature subset of saccharomycete data set.

Further, the position and speed of the more new particle includes：

Further, the CFS evaluation functions of the binding marker correlation are：

Count the positional number h that position is 1 in each particle：

If h > n, the position that h-n value is 1 is changed to 0 at random；

If h < n, the position that n-h value is 0 is changed to 1 at random.

Beneficial effects of the present invention：

A kind of the saccharomycete multiple labeling feature selection approach and device based on particle group optimizing of the present invention, by feature and Correlation between correlation, feature between label and the redundancy between feature, label and label constructs candidate feature The interpretational criteria function of collection, it is optimal to be selected from data set as discrete particle cluster adaptation of methods degree function Saccharomycete character subset.The present invention not only can effective selected characteristic subset, for follow-up work provide one simplify, standard True character subset improves classification performance moreover, reducing the time complexity and computation complexity of grader.

Description of the drawings

Fig. 1 is flow chart of the method for the present invention.

Specific implementation mode

In order to, according to the selection higher feature of property associated therewith is concentrated, one be provided for subsequent work from yeast count It simplifies, accurate character subset, the present invention provides a kind of saccharomycete multiple labeling feature selecting device based on particle group optimizing, The device includes processor, processor for executing code command stored in memory, realize the present invention based on particle The saccharomycete multiple labeling feature selection approach of group's optimization.Below in conjunction with the accompanying drawings, it elaborates to this method.

For entirety, this method is based on correlation and discrete particle cluster method, by between saccharomycete feature and label Redundancy, label between correlation, feature and feature and the correlation between label, the evaluation for constructing candidate feature subset are accurate Then function, weighs the degree of correlation between feature and classification problem, and in this, as discrete particle cluster adaptation of methods degree function, with The search characteristics space of machine, and then select optimal character subset.

First particle swarm optimization is introduced with the feature selection approach based on correlation below.

One, particle swarm optimization

Population (Particle Swarm Optimization, PSO) method is that Eberhart and Kennedy doctors exist What nineteen ninety-five proposed, it is studied from being preyed on to flock of birds.Compared with other evolvement methods, its maximum advantage is to realize simply With with stronger global optimization ability.In the method, each particle is by the comprehensive analysis to individual and group, and adjustment is certainly Oneself direction and speed find optimal solution by iteration.

In PSO methods, each particle has the speed for determining its direction and position.In the process of each iteration optimization In, each particle tracks the global optimum position that the optimal location pbest that it has found and all particles have been found Gbest determines the movement of next step with this.All there are one the fitness values that optimised function determines for all particles.

For the search space of D dimensions, in the t times iteration, it is assumed that the current position of i-th of particle is X_i=(x_i1, x_i2,…,x_iD), x_ijThe position of i-th of particle, j=1,2 ..., D are tieed up for j；Speed is V_i=(v_i1,v_i2,…,v_iD), v_ijFor j Tie up the speed of i-th of particle, j=1,2 ..., D；The optimal location pbest that the particle current search arrives is P_i=(p_i1,p_i2,…, p_iD), p_ijThe optimal location pbest, j=1,2 ... of i-th particle, D are tieed up for j；The optimal location that entire population searches Gbest is P_g=(g₁,g₂,…,g_D), g_jFor the optimal location gbest, j=1,2 ... that j ties up that entire population searches, D；Grain Subgroup updates oneself by itself optimal value and global optimum.In the t+1 times iteration, jth dimension i-th of particle according to Following formula completes the update to speed and position：

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1, 2,…,N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value；W is Inertial factor determines Inherited Quantity of the particle to preceding an iteration medium velocity；c₁And c₂For accelerated factor, usual c₁=c₂=2, Embody the particle ability that excellent individual learns into population；The position of i-th of particle, i=are tieed up for j in the t times iteration 1,2 ..., m, j=1,2 ..., D, t=1,2 ..., N_iter。

PSO can also formulate a maximum speed v_maxTo determine maximum moving distance of the particle in an iteration；Each grain The speed of son can all be limited in v_maxIn range；If the speed of particle is more than v_max, then speed can be set as v_max。

Two, the feature selection approach based on correlation

Feature selection approach (Correlation-Based Feature Selection, CFS) based on correlation is logical The correlation between redundancy, feature and label excessively between measure feature and feature, and then the interpretational criteria letter of construction feature Number, as the fitness function of heuristic search, a kind of feature selection approach to be assessed the value of feature.For The candidate feature subset generated in random searching strategy, CFS weigh spy using information gain or pearson linearly dependent coefficients The quality of subset is levied, principle is relatively easy, computation complexity is low, is easily achieved, and can effectively, efficiently select optimal spy Levy subset.

Feature selection approach CFS based on correlation is commented candidate feature subset in conjunction with heuristic search strategy The feature selection approach estimated.Due to using the correlation between redundancy, feature and the label between feature and feature as inspiration The interpretational criteria of formula search strategy, CFS consider the redundancy between feature and feature and the phase between feature and label simultaneously Guan Xing.The CFS interpretational criterias having the same of the CFS and multiple labeling that singly mark are：

Wherein, CFS (S) includes the evaluation of estimate of the candidate feature subset S of k feature, and the value of CFS (S) is bigger, candidate feature The relationship of subset S and classification problem is closer, i.e., candidate feature subset S is better；For saccharomycete candidate feature subset S and label Collect the average correlation between L,For the average redundancy between feature in saccharomycete candidate feature subset S.

Feature selection approach based on correlation will realize two targets：It maximizes between candidate feature subset S and label L Average correlationTo improve predictablity rate；Minimize the average redundancy between the feature in candidate feature subset SThe feature for redundancy occur to avoid candidate feature from concentrating, in order to avoid reduce the performance and efficiency of classification.

Average correlation between feature set and label setsPass through the correlation between single feature f and single marking l Property r_flCalculated, summed, be averaged to all labels in an identical manner, obtain single feature f and label sets L it Between average correlationIts calculation formula is：

Wherein, q is the number of label.

Summed in an identical manner to all features again, be averaged obtain it is flat between label sets L and feature set S Related propertyIts calculation formula is：

Wherein, the number that k is characterized；Q is the number of label；r_flCorrelation between single feature and single marking.

Average redundancy between feature and featureCan by pairs of to the feature combination of two in feature set, point Its redundancy is not sought, then summed, be averaged and provide, and calculation formula is：

Wherein, r_fifjRedundancy between single feature and single feature；f_iAnd f_jBe characterized in collection F two it is different Feature；Fp is characterized the feature logarithm in subset F.

When increasing or decreasing some feature in candidate character subset, the feature selection approach based on correlation will lead to Cross calculating average correlationWith average redundancyThe value of CFS (S) is provided, to determine whether this feature is added into most In excellent character subset, after information gain can weigh increase or delete some feature, the increase and decrease of information content, i.e. candidate feature The increase and decrease of the predictive ability of collection.Therefore, information gain can be used for calculating the average correlation between feature and featureAnd spy Average redundancy between sign and feature

Information gain measurement standard is that selected candidate feature can be that classification problem increases how many information, increased information It is more, show that candidate feature is more related to classification problem.For single feature, the information increased or decreased after selecting this feature Amount represents contribution degree of this feature to classification problem.

Entropy H (L), the label sets L of label sets L is set forth in candidate in the problem of for saccharomycete feature selecting, following formula Conditional entropy H (L | S) under character subset S：

Wherein, p (l) is the probability that label sets L values are l；P (l | f) it is characterized the item of label sets L under collection S known cases Part probability distribution.

Therefore, the calculation formula gain of the information gain of the correlation between candidate feature subset S and label sets L is：

Gain=H (L)-H (L | S)=H (S)-H (S | L)=H (L)+H (S)-H (S, L)

Wherein, H (L) is the entropy of label sets L；H (S) is characterized the entropy of collection S；H (L | S) is label sets L in candidate feature Collect the conditional entropy under S；H (S, L) is characterized the mutual information between subset S and label sets L.

However information gain is faced with a problem：Regardless of whether providing more information, it is worth larger variable and always compares The smaller variable of value carries more information, this can influence the solving result of classification problem.It is symmetrical uncertain (Symmetrical Uncertainty, SU) can solve the problems, such as this in degree length, and result is normalized to [0, 1] between.Symmetrically uncertainty formula is：

Wherein, H (L) is the entropy of label sets L；H (S) is characterized the entropy of collection S.

Using information gain method when, the redundancy in candidate feature subset S between feature and feature is：

Wherein, fp is characterized logarithm；H(f_i) it is characterized f_iEntropy；H(f_j) it is characterized f_jEntropy；H(f_i|f_j) it is characterized f_i In feature f_jUnder conditional entropy.

Correlation between candidate feature subset S and label sets L is：

Wherein, the number that k is characterized；Q is the number of label；H (l) is the entropy for marking l；H (f) is characterized the entropy of f；H(f| L) it is characterized conditional entropies of the f in the case where marking l.

Therefore, CFS criterion functions are：

Three, the saccharomycete multiple labeling feature selection approach based on particle group optimizing of the invention

In multiple labeling feature selecting, feature only exists selection and non-selected two kinds of situations, therefore continuous population side Method can not direct processing feature select permeability, therefore need exist for using discrete particle cluster method.

In discrete particle cluster method, particle is expressed as to the binary vector of 0 or 1 composition, vector length is all spies The quantity of sign, 1 indicates that individual features are selected, and 0 expression individual features are not selected.Speed is defined as particle each feature and takes 0 or 1 probability.Compared with particle swarm optimization, discrete particle cluster method is to be initialized in a manner of dualization and more new particle Position.

In discrete particle cluster method, initialized location, the calculation formula of speed are：

Wherein,The position of the initialization of i-th of particle, i=1,2 ..., m, j=1,2 ..., D are tieed up for j；Rand () is Equally distributed random function, between zero and one, each iteration can all regenerate value；For the initial of j i-th of particle of dimension Change speed, i=1,2 ..., m, j=1,2 ..., D；v_maxFor maximum speed.

And in subsequent each iteration t, the speed more new formula of discrete particle cluster particle is constant, and location update formula is：

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1, 2,…,N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value. For logistic functions, according to the speed of particleProvide the position of particle.

Work as speedWhen larger,Value can be approximately equal to 1, this can cause the position of particle to be 1 always, be unfavorable for Discrete particle method searches for globally optimal solution.Therefore, it is necessary to select suitable maximum speed v_max, novel time is generated to increase Select the possibility of solution.

It needs to control selected spy as the high-order feature selection approach of random searching strategy using discrete particle cluster method Levy the size of subset, it is therefore desirable to limit the value of the position of particle：

(1) when particle intermediate value is more than the size for the character subset to be selected for 1 positional number, random is 1 by value Position is set as 0；

(2) when particle intermediate value is less than the size for the character subset to be selected for 1 positional number, random is 0 by value Position is set as 1.

Compared with traditional particle swarm optimization, the location information of feature is only changed into the table in the form of 0 or 1 by discrete particle cluster The feature shown is selected and unselected information.But discrete particle cluster method is the absence of the stochastic search methods of partial detection, Particle is run with iteration, and increasingly there is randomness, lacking direction property not to restrain.In order to solve this problem, following formula can be used To enhance local search ability：

Wherein,The position of i-th of particle is tieed up for j in the t times iteration,For i-th of particle of j dimensions in the t times iteration Speed, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ..., N_iter；Rand () is equally distributed random function, and value exists Between 0 and 1, each iteration can all regenerate. For logistic letters Number, according to the speed of particleProvide the position of particle.

According to general heuristic search principle, then algorithm needs ability of searching optimum, later stage to need part early period Search capability, therefore make following modification again to this method：

If t < γ N_iter, using formulaWithThe position of more new particle, speed；Otherwise, Using formulaWithThe position of more new particle, speed.

Wherein, random numbers of the γ between [0,1]；N_iterFor iteration total degree；Method early period using it is original from Shot subgroup, stage are new transformation for mula.

CFS evaluation functions maximize the correlation of feature and label, to improve the accuracy rate of prediction；Between minimum feature Redundancy, avoid occurring redundancy properties in character subset, reduce the performance and efficiency of classification, be used in combination information gain or Pearson related coefficients weigh the quality of character subset.But the function does not consider the relationship between label and label, this So that calculate and it is inaccurate, to influence classification precision.It is to be mutually related between label and label, the correlation between marking Additional effective information can be provided.These information are made full use of, are conducive to establish better disaggregated model.Assuming that saccharomycete sample Originally the class label possessed is equal to its contribution margin, and the correlation between label and label is added in CFS evaluation functions, Improve CFS evaluation functions.The present invention normalizes thought according to norm, by correlation and spy between all labels of sample and label The sum of the correlation levied between label is set as 1, on the basis of original evaluation function, introduces the correlation between label, It is proposed a kind of CFS evaluation functions of integrated marker correlation, then new fitness function is defined as follows：

Wherein, CFS (S) includes the evaluation of estimate of the candidate feature subset S of k feature, and the value of CFS (S) is bigger, candidate feature The relationship of subset S and classification problem is closer, i.e., candidate feature subset S is better；For saccharomycete candidate feature subset S and mark Average correlation between note collection L,For the average correlation between saccharomycete label sets L and label sets L,For saccharomycete Average redundancy in candidate feature subset S between feature.

The correlation between correlation, label and label by maximizing feature and label, the mean value of the two is to improve The accuracy rate of prediction；The redundancy between feature is minimized, avoids occurring redundancy feature in character subset, reduces the performance of classification And efficiency.

Its detailed process is as follows：

Step1 data predictions：Extract saccharomycete sample data set, including multiple saccharomycete sample characteristics matrixes and sample Matrix is marked, the description of saccharomycete data set includes sample number, characteristic and the reference numerals of data set.It is every in eigenmatrix The sample of behavior saccharomycete data set, column vector is characterized in matrix；It marks per the sample of behavior saccharomycete data set in matrix, Column vector is label in matrix.Such as：The original saccharomycete data set X={ x for waiting for dimensionality reduction₁,x₂,…,x_n, wherein n is sample Number, each sample have several features.

The characteristic for the feature training sample set X that Step2 is obtained according to feature extraction, initializes binary-coded grain Subgroup：Position including population and initial velocity randomly generate one group of initial value.It is as follows to initialize speed, location formula：

Wherein,The position of the initialization of i-th of particle, i=1,2 ..., m, j=1,2 ..., D are tieed up for j；Rand () is Equally distributed random function, between zero and one, each iteration can all regenerate value；For the first of j i-th of particle of dimension Beginningization speed, i=1,2 ..., m, j=1,2 ..., D；v_maxFor maximum speed.

The population control that the positions Step3 are 1 is n, counts the positional number h that position is 1 in each particle respectively.If h>N, The position of the value 1 of h-n position is then changed to 0 at random；Otherwise, then the position that n-h value is 0 is changed to 1 at random.

Step4 calculates the adaptive value (fitness) of each particle in population according to CFS interpretational criteria functions, with into When row feature selecting, the scale of character subset, i.e. a particle energy are reduced under the premise of ensureing even to improve classifier performance Enough make the nicety of grading that grader generates higher, while the number of features selected is fewer, then its adaptive value is higher.It adapts to It is as follows to spend formula：

Wherein, CFS (S) includes the evaluation of estimate of the candidate feature subset S of k feature, and the value of CFS (S) is bigger, candidate feature The relationship of subset S and classification problem is closer, i.e., candidate feature subset S is better；For saccharomycete candidate feature subset S and label Collect the average correlation between L,For the average correlation between saccharomycete label sets L and label sets L,It is waited for saccharomycete Select the average redundancy between feature in character subset S.

Step5 is by the fitness p of current particleⁱWith the locally optimal solution pbest and globally optimal solution gbest of population into Row compares, if the fitness p of particleⁱMore than locally optimal solution pbest, then enables the locally optimal solution pbest of particle be equal to and adapt to Spend pⁱ；If the fitness p of particleⁱMore than globally optimal solution gbest, then the globally optimal solution gbest of population is enabled to be equal to fitness pⁱ。

After particle initialization, the speed of particle and position are a group RANDOM SOLUTION, then optimal solution are found by iteration, every In secondary iteration, particle updates oneself by two optimal solutions：One is optimal solution that particle itself is found, i.e. local optimum Solve pbest；The other is the optimal solution that population is found at present, referred to as globally optimal solution gbest.

Step6 according to Step4 newer locally optimal solution pbest and population globally optimal solution gbest, calculate grain The movement speed of son and new position.

Particle ceaselessly moves to search for optimal solution by each individual in population.Each particle is by oneself current part Two parts of globally optimal solution of optimal solution and all particles determine its direction of motion.Each particle represents in j dimension spaces One point, next position are determined by oneself current location and speed.Algorithm particle rapidity early period, location update formula It is as follows：

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1, 2,…,N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value；W is Inertial factor determines Inherited Quantity of the particle to preceding an iteration medium velocity；c₁And c₂For accelerated factor, usual c₁=c₂=2, Embody the particle ability that excellent individual learns into population；The position of i-th of particle, i=are tieed up for j in the t times iteration 1,2 ..., m, j=1,2 ..., D, t=1,2 ..., N_iter。 For Logistic functions, according to the speed of particleProvide the position of particle.

It is as follows in algorithm operation later stage particle position more new formula：

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1, 2,…,N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value. For logistic functions, according to the speed of particleProvide particle Position.

Compared with traditional PSO, the location information of feature is only changed into the feature indicated in the form of 0 or 1 and is chosen by BPSO In with unselected information.But BPSO algorithms are the absence of the random search algorithm of partial detection.Particle is run with iteration, more That more there is randomness, lacking direction property not to restrain.According to general heuristic random searching algorithm principle, therefore before algorithm Phase needs ability of searching optimum, stage to need local search ability.

Step7 judges whether to meet end condition, if meeting, terminates and exports optimal characteristics collection, otherwise iterations Add 1, and returns to Step3.

Emulation testing and experiment are carried out below for saccharomycete data set, so as to the attainable effect of the method for present invention institute Fruit is described further.This time experiment extraction saccharomycete sample data set, including multiple saccharomycete sample characteristics matrixes and sample This label matrix, the description of saccharomycete data set include sample number, characteristic and the reference numerals of data set.In eigenmatrix Every behavior saccharomycete data set sample, column vector is characterized in matrix.It marks in matrix per behavior saccharomycete data set Sample, column vector is label in matrix.The specific description information of saccharomycete data set is as shown in table 1.

The specific descriptions of 1 saccharomycete sample data set of table

It is write according to Wang Chenxi etc.《The multiple labeling feature selecting algorithm of fusion feature sequence》(computer engineering with answer With 2016,52 (17):93-100.), AP (Average Precision), CV (Coverage), HL (Hamming are chosen Loss), RL (Ranking Loss) four kinds of evaluation indexes are analyzed and metrology experiment result.

The test set is enabled to beAccording to function f₁(x) can define ranking functions is rank_f(x,l)∈{1,2,…,L}。

AP：During predictive marker for investigating all samples sorts, the label being under the jurisdiction of before the sample labeling is come still Belong to being averaged for the probability of the sample labeling, is defined as follows：

Wherein, R_i=l | Y_il=+1 } it indicates and sample x_iThe set that relevant label is constituted, R_i=l | Y_il=-1 } table Show and sample x_iThe set that incoherent label is constituted.

CV：All and relevant label of the sample could be traversed by requiring to look up how many steps for each sample of metric averaging, It is defined as follows：

HL：For measuring misclassification situation of the sample on single marking, it is defined as follows：

RL：For investigate all samples uncorrelated label the average value for coming the probability before mark of correlation, It is defined as follows：

In order to verify the validity of this method, write using Zhang and Zhou《Multilabel dimensionality reduction via dependence maximization》(ACM Transactions on Knowledge Discovery from Data(TKDD),2010,4(3):14.) (MDDMspc, MDDMproj), Yu and Wang It writes《Feature selection for multi-label learning using mutual information and GA》(International Conference on Rough Sets and Knowledge Technology, Springer,Cham,2014:454-463.) (MLFSIE) algorithm is tested as a comparison, is write using Zhang and Zhou 《ML-KNN:A lazy learning approach to multi-label learning》(Pattern recognition,2007,40(7):2038-2048.) (ML-kNN) assesses data set after selection.Wherein, ML- The smoothing parameter s of kNN is set as 1, and neighbour k is set as 10.In addition, what MDDMspc and MDDMproj algorithms obtained is one group Feature ordering.In order to compare the classification performance for the character subset that each method obtains, the preceding k feature of feature ordering will be taken in experiment As character subset.

This experimental result of four methods on saccharomycete data set is set forth in the following table 2.Each evaluation is referred to Mark, symbol " ↑ " indicate to refer to that target value is bigger, and classification performance is more excellent；Symbol " ↓ " indicates to refer to that target value is smaller, and classification performance is more excellent, The experimental result of best performance in control methods is indicated using overstriking.

2 Yeast classification performances of table compare

Contrast table 2 is can be found that：For this 4 evaluation indexes of AP, CV, HL and RL：CFS-NBPSO algorithm yojan obtains Number of features reaches the 10% of initial data, plays good dimensionality reduction purpose, and CFS-NBPSO algorithms are in all experiments The classification performance obtained on data set can be better than MDDMspc, MDDMproj and MLFSIE algorithm.MDDMspc and MDDMproj Although algorithm has obtained less characteristic, but also eliminated and some the relevant spies that classify during removing redundancy feature Sign causes its classifying quality to decline very much, and the purpose for carrying out dimensionality reduction with us conflicts.

To sum up, according to 16 comparing results (4 evaluation indexes and 4 algorithms) in table, method of the invention has 100% Situation can obtain optimal value.The above analysis of experimental results fully shows the character subset that institute's extracting method of the present invention obtains and lures The classification performance that export comes is to be substantially better than other three kinds comparison algorithms.

Although present disclosure is discussed in detail by above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read the above, for the present invention's A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. a kind of saccharomycete multiple labeling feature selection approach based on particle group optimizing, which is characterized in that include the following steps：

Saccharomycete sample data set is extracted, the saccharomycete sample data set includes multiple saccharomycete sample characteristics matrixes and sample Mark matrix；

The characteristic for extracting saccharomycete sample data set, initializes binary-coded population, and initialize population Position and speed；

By the correlation between redundancy, feature and the label between measures characteristic and feature, the phase between label and label Guan Xing constructs the CFS interpretational criteria functions of binding marker correlation；

To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if better than living through Optimal location pbest, then the optimal location pbest lived through the adaptive value of the calculating as it；

The position and speed of more new particle is iterated, and the optimal location gbest intermediate values of finally obtained group are corresponding to 1 Feature is the optimal feature subset of saccharomycete data set.

2. the saccharomycete multiple labeling feature selection approach according to claim 1 based on particle group optimizing, which is characterized in that The position and speed of the more new particle includes：

Judge whether to meet t < γ N_iter, random numbers of the wherein γ between [0,1], N_iterFor iteration total degree；

Wherein,The speed of i-th of particle, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ..., N are tieed up for j in the t times iteration_iter； Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value； For logistic functions, according to the speed of particleProvide the position of particle；

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ..., N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value； For logistic functions, according to the speed of particleProvide particle Position.

3. the saccharomycete multiple labeling feature selection approach according to claim 1 based on particle group optimizing, which is characterized in that The CFS evaluation functions of the binding marker correlation are：

Wherein, CFS (S) is the evaluation of estimate of the candidate subset S comprising k feature；For saccharomycete candidate feature subset S and label Collect the average correlation between L,For the average correlation between saccharomycete label sets L and label sets L,It is waited for saccharomycete Select the average redundancy between feature in character subset S.

4. the saccharomycete multiple labeling feature selection approach according to claim 1 based on particle group optimizing, which is characterized in that Further include the steps that the control of population that position is 1 it is n before the adaptive value for calculating each particle：

Count the positional number h that position is 1 in each particle：

If h > n, the position that h-n value is 1 is changed to 0 at random；

If h < n, the position that n-h value is 0 is changed to 1 at random.

5. a kind of saccharomycete multiple labeling feature selecting device based on particle group optimizing, which is characterized in that described including processor Processor realizes following method for executing instruction：

By the correlation between redundancy, feature and the label between measures characteristic and feature, the phase between label and label Guan Xing constructs the CFS interpretational criteria functions of binding marker correlation；According to the CFS evaluation functions of the binding marker correlation, Calculate the adaptive value of each particle；

To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if being better than optimal position Pbest is set, then the optimal location pbest lived through the adaptive value of the calculating as it；

The position and speed of more new particle is iterated, and the optimal location gbest intermediate values for finally obtaining group are the spy corresponding to 1 Sign is the optimal feature subset of saccharomycete data set.

6. the saccharomycete multiple labeling feature selecting device according to claim 5 based on particle group optimizing, which is characterized in that The position and speed of the more new particle includes：

Judge whether to meet t < γ N_iter, random numbers of the wherein γ between [0,1]；N_iterFor iteration total degree；

Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ..., N_iter；Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value； For logistic functions, according to the speed of particleProvide the position of particle；

7. the saccharomycete multiple labeling feature selecting device according to claim 5 based on particle group optimizing, which is characterized in that The CFS evaluation functions of the binding marker correlation are：

8. the saccharomycete multiple labeling feature selecting device according to claim 5 based on particle group optimizing, which is characterized in that Further include the steps that the control of population that position is 1 it is n before the adaptive value for calculating each particle：

Count the positional number h that position is 1 in each particle：

If h > n, the position that h-n value is 1 is changed to 0 at random；

If h < n, the position that n-h value is 0 is changed to 1 at random.