CN108805162A - A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing - Google Patents
A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing Download PDFInfo
- Publication number
- CN108805162A CN108805162A CN201810380973.0A CN201810380973A CN108805162A CN 108805162 A CN108805162 A CN 108805162A CN 201810380973 A CN201810380973 A CN 201810380973A CN 108805162 A CN108805162 A CN 108805162A
- Authority
- CN
- China
- Prior art keywords
- particle
- saccharomycete
- feature
- label
- speed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Abstract
The present invention relates to a kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing, the interpretational criteria function of candidate feature subset is constructed by the correlation between redundancy, label and the label between correlation, feature and the feature between saccharomycete feature and label, as discrete particle cluster adaptation of methods degree function, to select optimal character subset according to concentration from yeast count.The present invention not only can effective selected characteristic subset, for follow-up work provide one simplify, accurate character subset improves classification performance moreover, effectively reducing the time complexity and computation complexity of grader.
Description
Technical field
The invention belongs to technical field of data processing, and in particular to a kind of saccharomycete multiple labeling based on particle group optimizing is special
Levy selection method and device.
Background technology
In traditional supervised learning frame, one and only one category label of each learning object, and be between marking
It is mutually exclusive and independent.For example, in Gender Classification forecasting problem, " gender " label is only existed, and mark value is wanted
It is " man " or being " female ", there is no mark value overlappings.However an individual label can not in actual life
The accurately complicated object of description, an object may be related to multiple category labels, and correlation is also likely to be present between label.
For example, in text classification, the news report of one entitled " Yang Shuan talks the Olympic Games and prepares " can be classified into " sport ", " hand over
It is logical ", " weather ", " economy " and " politics " plate in;In image classification, a sub-picture, may with " sandy beach ", " sea ",
Multiple semantic markers such as " coco " are related;In addition, in music emotion analysis, according to the difference of the emotion of expression, a first song
There can be the labels such as " cheerful and light-hearted ", " sorrow ", " homesickness " simultaneously.The object for possessing multiple labels is seen everywhere in life, because
This multiple labeling classification in recent years causes the extensive research and concern of people.
Bioinformatics is that Multi-label learning applies wider field.Yeast S cell cycle expression data are one
A common Multi-label learning data and a typical bioinformatics task, task be predict these saccharomycete with
Mark whether correlation in 14 Directory of Features.Often exist in such application, between label certain hierarchical structure and
Found by domain expert, for example, the Directory of Features of tree and directed acyclic graph structures gene topological structure etc.,
Therefore, it needs to utilize the relationship between these labels well in application Multi-label learning technology.
There are a series of challenges for saccharomycete function prediction:The possible category label of each saccharomycete sample of one side is very
It is more, and there is certain correlation between these labels, therefore the correlation between considering to mark is needed in Multi-label learning
Property;On the other hand, since yeast count is according to being all to have the gene order of higher-dimension to describe, saccharomycete sample has quantity big
The high two major features with vector dimension determine that saccharomycete feature selecting is that a run time and space complexity are all very high
Machine Learning Problems, the understanding and modeling that the excessively high dimension of these data influences and restricts us to data.In the prior art
In, there is the feature selection approach of some yeast count evidences.For example, some are to forecast the embedded feature selection approach of risk
Based on, by evaluating each feature, finally obtain optimal feature subset.This method and grader and evaluation index are close
Cut phase is closed, it is likely that leads to calculate that the time is longer, dimensionality reduction efficiency is low.
Invention content
The purpose of the present invention is to provide a kind of saccharomycete multiple labeling feature selection approach and dress based on particle group optimizing
It sets, to solve the problems, such as that the calculating time of feature selection approach in the prior art is long, efficiency is low.
In order to solve the above technical problems, the technical scheme is that:
The present invention provides a kind of saccharomycete multiple labeling feature selection approach based on particle group optimizing, including walk as follows
Suddenly:
Extract saccharomycete sample data set, the saccharomycete sample data set include multiple saccharomycete sample characteristics matrixes and
Sample labeling matrix;
The characteristic for extracting saccharomycete sample data set, initializes binary-coded population;And initialize particle
The position and speed of group;
By between correlation, label and the label between redundancy, feature and the label between measures characteristic and feature
Correlation, construct binding marker correlation CFS interpretational criteria functions;
According to the CFS evaluation functions of the binding marker correlation, the adaptive value of each particle is calculated;
To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if better than warp
The optimal location pbest gone through, then the optimal location pbest lived through the adaptive value of the calculating as it;
And using the optimal location pbest of all particles as the optimal location gbest of group;
The position and speed of more new particle is iterated, and the optimal location gbest intermediate values of finally obtained group are right for 1
The feature answered, the as optimal feature subset of saccharomycete data set.
Further, the position and speed of the more new particle includes:
Judge whether to meet t < γ Niter, wherein random numbers of the γ between [0,1], NiterFor iteration total degree;
If t < γ Niter, then the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide the position of particle;
Otherwise, the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide particle
Position.
Ability of searching optimum is needed in early period, the later stage needs local search ability, therefore is directed to different situations, using difference
Formula carry out the position and speed of more new particle.
Further, the CFS evaluation functions of the binding marker correlation are:
Wherein, CFS (S) is the evaluation of estimate of the candidate subset S comprising k feature;For saccharomycete candidate feature subset S with
Average correlation between label sets L,For the average correlation between saccharomycete label sets L and label sets L,For yeast
Average redundancy in bacterium candidate feature subset S between feature.
Further, further include the steps that the control of population that position is 1 it is n before the adaptive value for calculating each particle:
Count the positional number h that position is 1 in each particle:
If h > n, the position that h-n value is 1 is changed to 0 at random;
If h < n, the position that n-h value is 0 is changed to 1 at random.
The present invention also provides a kind of saccharomycete multiple labeling feature selecting device based on particle group optimizing, including processing
Device, the processor realize following method for executing instruction:
Extract saccharomycete sample data set, the saccharomycete sample data set include multiple saccharomycete sample characteristics matrixes and
Sample labeling matrix;
The characteristic for extracting saccharomycete sample data set, initializes binary-coded population, and initialize particle
The position and speed of group;
By the correlation between redundancy, feature and the label between measures characteristic and feature, between label and label
Correlation, construct binding marker correlation CFS interpretational criteria functions;
According to the CFS evaluation functions of the binding marker correlation, the adaptive value of each particle is calculated;
To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if better than most
Excellent position pbest, then the optimal location pbest lived through the adaptive value of the calculating as it;
And using the optimal location pbest of all particles as the optimal location gbest of group;
The position and speed of more new particle is iterated, and the optimal location gbest intermediate values of finally obtained group are right for 1
The feature answered is the optimal feature subset of saccharomycete data set.
Further, the position and speed of the more new particle includes:
Judge whether to meet t < γ Niter, wherein random numbers of the γ between [0,1], NiterFor iteration total degree;
If t < γ Niter, then the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide the position of particle;
Otherwise, the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide particle
Position.
Further, the CFS evaluation functions of the binding marker correlation are:
Wherein, CFS (S) is the evaluation of estimate of the candidate subset S comprising k feature;For saccharomycete candidate feature subset S with
Average correlation between label sets L,For the average correlation between saccharomycete label sets L and label sets L,For yeast
Average redundancy in bacterium candidate feature subset S between feature.
Further, further include the steps that the control of population that position is 1 it is n before the adaptive value for calculating each particle:
Count the positional number h that position is 1 in each particle:
If h > n, the position that h-n value is 1 is changed to 0 at random;
If h < n, the position that n-h value is 0 is changed to 1 at random.
Beneficial effects of the present invention:
A kind of the saccharomycete multiple labeling feature selection approach and device based on particle group optimizing of the present invention, by feature and
Correlation between correlation, feature between label and the redundancy between feature, label and label constructs candidate feature
The interpretational criteria function of collection, it is optimal to be selected from data set as discrete particle cluster adaptation of methods degree function
Saccharomycete character subset.The present invention not only can effective selected characteristic subset, for follow-up work provide one simplify, standard
True character subset improves classification performance moreover, reducing the time complexity and computation complexity of grader.
Description of the drawings
Fig. 1 is flow chart of the method for the present invention.
Specific implementation mode
In order to, according to the selection higher feature of property associated therewith is concentrated, one be provided for subsequent work from yeast count
It simplifies, accurate character subset, the present invention provides a kind of saccharomycete multiple labeling feature selecting device based on particle group optimizing,
The device includes processor, processor for executing code command stored in memory, realize the present invention based on particle
The saccharomycete multiple labeling feature selection approach of group's optimization.Below in conjunction with the accompanying drawings, it elaborates to this method.
For entirety, this method is based on correlation and discrete particle cluster method, by between saccharomycete feature and label
Redundancy, label between correlation, feature and feature and the correlation between label, the evaluation for constructing candidate feature subset are accurate
Then function, weighs the degree of correlation between feature and classification problem, and in this, as discrete particle cluster adaptation of methods degree function, with
The search characteristics space of machine, and then select optimal character subset.
First particle swarm optimization is introduced with the feature selection approach based on correlation below.
One, particle swarm optimization
Population (Particle Swarm Optimization, PSO) method is that Eberhart and Kennedy doctors exist
What nineteen ninety-five proposed, it is studied from being preyed on to flock of birds.Compared with other evolvement methods, its maximum advantage is to realize simply
With with stronger global optimization ability.In the method, each particle is by the comprehensive analysis to individual and group, and adjustment is certainly
Oneself direction and speed find optimal solution by iteration.
In PSO methods, each particle has the speed for determining its direction and position.In the process of each iteration optimization
In, each particle tracks the global optimum position that the optimal location pbest that it has found and all particles have been found
Gbest determines the movement of next step with this.All there are one the fitness values that optimised function determines for all particles.
For the search space of D dimensions, in the t times iteration, it is assumed that the current position of i-th of particle is Xi=(xi1,
xi2,…,xiD), xijThe position of i-th of particle, j=1,2 ..., D are tieed up for j;Speed is Vi=(vi1,vi2,…,viD), vijFor j
Tie up the speed of i-th of particle, j=1,2 ..., D;The optimal location pbest that the particle current search arrives is Pi=(pi1,pi2,…,
piD), pijThe optimal location pbest, j=1,2 ... of i-th particle, D are tieed up for j;The optimal location that entire population searches
Gbest is Pg=(g1,g2,…,gD), gjFor the optimal location gbest, j=1,2 ... that j ties up that entire population searches, D;Grain
Subgroup updates oneself by itself optimal value and global optimum.In the t+1 times iteration, jth dimension i-th of particle according to
Following formula completes the update to speed and position:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value;W is
Inertial factor determines Inherited Quantity of the particle to preceding an iteration medium velocity;c1And c2For accelerated factor, usual c1=c2=2,
Embody the particle ability that excellent individual learns into population;The position of i-th of particle, i=are tieed up for j in the t times iteration
1,2 ..., m, j=1,2 ..., D, t=1,2 ..., Niter。
PSO can also formulate a maximum speed vmaxTo determine maximum moving distance of the particle in an iteration;Each grain
The speed of son can all be limited in vmaxIn range;If the speed of particle is more than vmax, then speed can be set as vmax。
Two, the feature selection approach based on correlation
Feature selection approach (Correlation-Based Feature Selection, CFS) based on correlation is logical
The correlation between redundancy, feature and label excessively between measure feature and feature, and then the interpretational criteria letter of construction feature
Number, as the fitness function of heuristic search, a kind of feature selection approach to be assessed the value of feature.For
The candidate feature subset generated in random searching strategy, CFS weigh spy using information gain or pearson linearly dependent coefficients
The quality of subset is levied, principle is relatively easy, computation complexity is low, is easily achieved, and can effectively, efficiently select optimal spy
Levy subset.
Feature selection approach CFS based on correlation is commented candidate feature subset in conjunction with heuristic search strategy
The feature selection approach estimated.Due to using the correlation between redundancy, feature and the label between feature and feature as inspiration
The interpretational criteria of formula search strategy, CFS consider the redundancy between feature and feature and the phase between feature and label simultaneously
Guan Xing.The CFS interpretational criterias having the same of the CFS and multiple labeling that singly mark are:
Wherein, CFS (S) includes the evaluation of estimate of the candidate feature subset S of k feature, and the value of CFS (S) is bigger, candidate feature
The relationship of subset S and classification problem is closer, i.e., candidate feature subset S is better;For saccharomycete candidate feature subset S and label
Collect the average correlation between L,For the average redundancy between feature in saccharomycete candidate feature subset S.
Feature selection approach based on correlation will realize two targets:It maximizes between candidate feature subset S and label L
Average correlationTo improve predictablity rate;Minimize the average redundancy between the feature in candidate feature subset SThe feature for redundancy occur to avoid candidate feature from concentrating, in order to avoid reduce the performance and efficiency of classification.
Average correlation between feature set and label setsPass through the correlation between single feature f and single marking l
Property rflCalculated, summed, be averaged to all labels in an identical manner, obtain single feature f and label sets L it
Between average correlationIts calculation formula is:
Wherein, q is the number of label.
Summed in an identical manner to all features again, be averaged obtain it is flat between label sets L and feature set S
Related propertyIts calculation formula is:
Wherein, the number that k is characterized;Q is the number of label;rflCorrelation between single feature and single marking.
Average redundancy between feature and featureCan by pairs of to the feature combination of two in feature set, point
Its redundancy is not sought, then summed, be averaged and provide, and calculation formula is:
Wherein, rfifjRedundancy between single feature and single feature;fiAnd fjBe characterized in collection F two it is different
Feature;Fp is characterized the feature logarithm in subset F.
When increasing or decreasing some feature in candidate character subset, the feature selection approach based on correlation will lead to
Cross calculating average correlationWith average redundancyThe value of CFS (S) is provided, to determine whether this feature is added into most
In excellent character subset, after information gain can weigh increase or delete some feature, the increase and decrease of information content, i.e. candidate feature
The increase and decrease of the predictive ability of collection.Therefore, information gain can be used for calculating the average correlation between feature and featureAnd spy
Average redundancy between sign and feature
Information gain measurement standard is that selected candidate feature can be that classification problem increases how many information, increased information
It is more, show that candidate feature is more related to classification problem.For single feature, the information increased or decreased after selecting this feature
Amount represents contribution degree of this feature to classification problem.
Entropy H (L), the label sets L of label sets L is set forth in candidate in the problem of for saccharomycete feature selecting, following formula
Conditional entropy H (L | S) under character subset S:
Wherein, p (l) is the probability that label sets L values are l;P (l | f) it is characterized the item of label sets L under collection S known cases
Part probability distribution.
Therefore, the calculation formula gain of the information gain of the correlation between candidate feature subset S and label sets L is:
Gain=H (L)-H (L | S)=H (S)-H (S | L)=H (L)+H (S)-H (S, L)
Wherein, H (L) is the entropy of label sets L;H (S) is characterized the entropy of collection S;H (L | S) is label sets L in candidate feature
Collect the conditional entropy under S;H (S, L) is characterized the mutual information between subset S and label sets L.
However information gain is faced with a problem:Regardless of whether providing more information, it is worth larger variable and always compares
The smaller variable of value carries more information, this can influence the solving result of classification problem.It is symmetrical uncertain
(Symmetrical Uncertainty, SU) can solve the problems, such as this in degree length, and result is normalized to [0,
1] between.Symmetrically uncertainty formula is:
Wherein, H (L) is the entropy of label sets L;H (S) is characterized the entropy of collection S.
Using information gain method when, the redundancy in candidate feature subset S between feature and feature is:
Wherein, fp is characterized logarithm;H(fi) it is characterized fiEntropy;H(fj) it is characterized fjEntropy;H(fi|fj) it is characterized fi
In feature fjUnder conditional entropy.
Correlation between candidate feature subset S and label sets L is:
Wherein, the number that k is characterized;Q is the number of label;H (l) is the entropy for marking l;H (f) is characterized the entropy of f;H(f|
L) it is characterized conditional entropies of the f in the case where marking l.
Therefore, CFS criterion functions are:
Three, the saccharomycete multiple labeling feature selection approach based on particle group optimizing of the invention
In multiple labeling feature selecting, feature only exists selection and non-selected two kinds of situations, therefore continuous population side
Method can not direct processing feature select permeability, therefore need exist for using discrete particle cluster method.
In discrete particle cluster method, particle is expressed as to the binary vector of 0 or 1 composition, vector length is all spies
The quantity of sign, 1 indicates that individual features are selected, and 0 expression individual features are not selected.Speed is defined as particle each feature and takes
0 or 1 probability.Compared with particle swarm optimization, discrete particle cluster method is to be initialized in a manner of dualization and more new particle
Position.
In discrete particle cluster method, initialized location, the calculation formula of speed are:
Wherein,The position of the initialization of i-th of particle, i=1,2 ..., m, j=1,2 ..., D are tieed up for j;Rand () is
Equally distributed random function, between zero and one, each iteration can all regenerate value;For the initial of j i-th of particle of dimension
Change speed, i=1,2 ..., m, j=1,2 ..., D;vmaxFor maximum speed.
And in subsequent each iteration t, the speed more new formula of discrete particle cluster particle is constant, and location update formula is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value. For logistic functions, according to the speed of particleProvide the position of particle.
Work as speedWhen larger,Value can be approximately equal to 1, this can cause the position of particle to be 1 always, be unfavorable for
Discrete particle method searches for globally optimal solution.Therefore, it is necessary to select suitable maximum speed vmax, novel time is generated to increase
Select the possibility of solution.
It needs to control selected spy as the high-order feature selection approach of random searching strategy using discrete particle cluster method
Levy the size of subset, it is therefore desirable to limit the value of the position of particle:
(1) when particle intermediate value is more than the size for the character subset to be selected for 1 positional number, random is 1 by value
Position is set as 0;
(2) when particle intermediate value is less than the size for the character subset to be selected for 1 positional number, random is 0 by value
Position is set as 1.
Compared with traditional particle swarm optimization, the location information of feature is only changed into the table in the form of 0 or 1 by discrete particle cluster
The feature shown is selected and unselected information.But discrete particle cluster method is the absence of the stochastic search methods of partial detection,
Particle is run with iteration, and increasingly there is randomness, lacking direction property not to restrain.In order to solve this problem, following formula can be used
To enhance local search ability:
Wherein,The position of i-th of particle is tieed up for j in the t times iteration,For i-th of particle of j dimensions in the t times iteration
Speed, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ..., Niter;Rand () is equally distributed random function, and value exists
Between 0 and 1, each iteration can all regenerate. For logistic letters
Number, according to the speed of particleProvide the position of particle.
According to general heuristic search principle, then algorithm needs ability of searching optimum, later stage to need part early period
Search capability, therefore make following modification again to this method:
If t < γ Niter, using formulaWithThe position of more new particle, speed;Otherwise,
Using formulaWithThe position of more new particle, speed.
Wherein, random numbers of the γ between [0,1];NiterFor iteration total degree;Method early period using it is original from
Shot subgroup, stage are new transformation for mula.
CFS evaluation functions maximize the correlation of feature and label, to improve the accuracy rate of prediction;Between minimum feature
Redundancy, avoid occurring redundancy properties in character subset, reduce the performance and efficiency of classification, be used in combination information gain or
Pearson related coefficients weigh the quality of character subset.But the function does not consider the relationship between label and label, this
So that calculate and it is inaccurate, to influence classification precision.It is to be mutually related between label and label, the correlation between marking
Additional effective information can be provided.These information are made full use of, are conducive to establish better disaggregated model.Assuming that saccharomycete sample
Originally the class label possessed is equal to its contribution margin, and the correlation between label and label is added in CFS evaluation functions,
Improve CFS evaluation functions.The present invention normalizes thought according to norm, by correlation and spy between all labels of sample and label
The sum of the correlation levied between label is set as 1, on the basis of original evaluation function, introduces the correlation between label,
It is proposed a kind of CFS evaluation functions of integrated marker correlation, then new fitness function is defined as follows:
Wherein, CFS (S) includes the evaluation of estimate of the candidate feature subset S of k feature, and the value of CFS (S) is bigger, candidate feature
The relationship of subset S and classification problem is closer, i.e., candidate feature subset S is better;For saccharomycete candidate feature subset S and mark
Average correlation between note collection L,For the average correlation between saccharomycete label sets L and label sets L,For saccharomycete
Average redundancy in candidate feature subset S between feature.
The correlation between correlation, label and label by maximizing feature and label, the mean value of the two is to improve
The accuracy rate of prediction;The redundancy between feature is minimized, avoids occurring redundancy feature in character subset, reduces the performance of classification
And efficiency.
Its detailed process is as follows:
Step1 data predictions:Extract saccharomycete sample data set, including multiple saccharomycete sample characteristics matrixes and sample
Matrix is marked, the description of saccharomycete data set includes sample number, characteristic and the reference numerals of data set.It is every in eigenmatrix
The sample of behavior saccharomycete data set, column vector is characterized in matrix;It marks per the sample of behavior saccharomycete data set in matrix,
Column vector is label in matrix.Such as:The original saccharomycete data set X={ x for waiting for dimensionality reduction1,x2,…,xn, wherein n is sample
Number, each sample have several features.
The characteristic for the feature training sample set X that Step2 is obtained according to feature extraction, initializes binary-coded grain
Subgroup:Position including population and initial velocity randomly generate one group of initial value.It is as follows to initialize speed, location formula:
Wherein,The position of the initialization of i-th of particle, i=1,2 ..., m, j=1,2 ..., D are tieed up for j;Rand () is
Equally distributed random function, between zero and one, each iteration can all regenerate value;For the first of j i-th of particle of dimension
Beginningization speed, i=1,2 ..., m, j=1,2 ..., D;vmaxFor maximum speed.
The population control that the positions Step3 are 1 is n, counts the positional number h that position is 1 in each particle respectively.If h>N,
The position of the value 1 of h-n position is then changed to 0 at random;Otherwise, then the position that n-h value is 0 is changed to 1 at random.
Step4 calculates the adaptive value (fitness) of each particle in population according to CFS interpretational criteria functions, with into
When row feature selecting, the scale of character subset, i.e. a particle energy are reduced under the premise of ensureing even to improve classifier performance
Enough make the nicety of grading that grader generates higher, while the number of features selected is fewer, then its adaptive value is higher.It adapts to
It is as follows to spend formula:
Wherein, CFS (S) includes the evaluation of estimate of the candidate feature subset S of k feature, and the value of CFS (S) is bigger, candidate feature
The relationship of subset S and classification problem is closer, i.e., candidate feature subset S is better;For saccharomycete candidate feature subset S and label
Collect the average correlation between L,For the average correlation between saccharomycete label sets L and label sets L,It is waited for saccharomycete
Select the average redundancy between feature in character subset S.
Step5 is by the fitness p of current particleiWith the locally optimal solution pbest and globally optimal solution gbest of population into
Row compares, if the fitness p of particleiMore than locally optimal solution pbest, then enables the locally optimal solution pbest of particle be equal to and adapt to
Spend pi;If the fitness p of particleiMore than globally optimal solution gbest, then the globally optimal solution gbest of population is enabled to be equal to fitness
pi。
After particle initialization, the speed of particle and position are a group RANDOM SOLUTION, then optimal solution are found by iteration, every
In secondary iteration, particle updates oneself by two optimal solutions:One is optimal solution that particle itself is found, i.e. local optimum
Solve pbest;The other is the optimal solution that population is found at present, referred to as globally optimal solution gbest.
Step6 according to Step4 newer locally optimal solution pbest and population globally optimal solution gbest, calculate grain
The movement speed of son and new position.
Particle ceaselessly moves to search for optimal solution by each individual in population.Each particle is by oneself current part
Two parts of globally optimal solution of optimal solution and all particles determine its direction of motion.Each particle represents in j dimension spaces
One point, next position are determined by oneself current location and speed.Algorithm particle rapidity early period, location update formula
It is as follows:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value;W is
Inertial factor determines Inherited Quantity of the particle to preceding an iteration medium velocity;c1And c2For accelerated factor, usual c1=c2=2,
Embody the particle ability that excellent individual learns into population;The position of i-th of particle, i=are tieed up for j in the t times iteration
1,2 ..., m, j=1,2 ..., D, t=1,2 ..., Niter。 For
Logistic functions, according to the speed of particleProvide the position of particle.
It is as follows in algorithm operation later stage particle position more new formula:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,
2,…,Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value. For logistic functions, according to the speed of particleProvide particle
Position.
Compared with traditional PSO, the location information of feature is only changed into the feature indicated in the form of 0 or 1 and is chosen by BPSO
In with unselected information.But BPSO algorithms are the absence of the random search algorithm of partial detection.Particle is run with iteration, more
That more there is randomness, lacking direction property not to restrain.According to general heuristic random searching algorithm principle, therefore before algorithm
Phase needs ability of searching optimum, stage to need local search ability.
Step7 judges whether to meet end condition, if meeting, terminates and exports optimal characteristics collection, otherwise iterations
Add 1, and returns to Step3.
Emulation testing and experiment are carried out below for saccharomycete data set, so as to the attainable effect of the method for present invention institute
Fruit is described further.This time experiment extraction saccharomycete sample data set, including multiple saccharomycete sample characteristics matrixes and sample
This label matrix, the description of saccharomycete data set include sample number, characteristic and the reference numerals of data set.In eigenmatrix
Every behavior saccharomycete data set sample, column vector is characterized in matrix.It marks in matrix per behavior saccharomycete data set
Sample, column vector is label in matrix.The specific description information of saccharomycete data set is as shown in table 1.
The specific descriptions of 1 saccharomycete sample data set of table
It is write according to Wang Chenxi etc.《The multiple labeling feature selecting algorithm of fusion feature sequence》(computer engineering with answer
With 2016,52 (17):93-100.), AP (Average Precision), CV (Coverage), HL (Hamming are chosen
Loss), RL (Ranking Loss) four kinds of evaluation indexes are analyzed and metrology experiment result.
The test set is enabled to beAccording to function f1(x) can define ranking functions is
rankf(x,l)∈{1,2,…,L}。
AP:During predictive marker for investigating all samples sorts, the label being under the jurisdiction of before the sample labeling is come still
Belong to being averaged for the probability of the sample labeling, is defined as follows:
Wherein, Ri=l | Yil=+1 } it indicates and sample xiThe set that relevant label is constituted, Ri=l | Yil=-1 } table
Show and sample xiThe set that incoherent label is constituted.
CV:All and relevant label of the sample could be traversed by requiring to look up how many steps for each sample of metric averaging,
It is defined as follows:
HL:For measuring misclassification situation of the sample on single marking, it is defined as follows:
RL:For investigate all samples uncorrelated label the average value for coming the probability before mark of correlation,
It is defined as follows:
In order to verify the validity of this method, write using Zhang and Zhou《Multilabel
dimensionality reduction via dependence maximization》(ACM Transactions on
Knowledge Discovery from Data(TKDD),2010,4(3):14.) (MDDMspc, MDDMproj), Yu and Wang
It writes《Feature selection for multi-label learning using mutual information
and GA》(International Conference on Rough Sets and Knowledge Technology,
Springer,Cham,2014:454-463.) (MLFSIE) algorithm is tested as a comparison, is write using Zhang and Zhou
《ML-KNN:A lazy learning approach to multi-label learning》(Pattern
recognition,2007,40(7):2038-2048.) (ML-kNN) assesses data set after selection.Wherein, ML-
The smoothing parameter s of kNN is set as 1, and neighbour k is set as 10.In addition, what MDDMspc and MDDMproj algorithms obtained is one group
Feature ordering.In order to compare the classification performance for the character subset that each method obtains, the preceding k feature of feature ordering will be taken in experiment
As character subset.
This experimental result of four methods on saccharomycete data set is set forth in the following table 2.Each evaluation is referred to
Mark, symbol " ↑ " indicate to refer to that target value is bigger, and classification performance is more excellent;Symbol " ↓ " indicates to refer to that target value is smaller, and classification performance is more excellent,
The experimental result of best performance in control methods is indicated using overstriking.
2 Yeast classification performances of table compare
Contrast table 2 is can be found that:For this 4 evaluation indexes of AP, CV, HL and RL:CFS-NBPSO algorithm yojan obtains
Number of features reaches the 10% of initial data, plays good dimensionality reduction purpose, and CFS-NBPSO algorithms are in all experiments
The classification performance obtained on data set can be better than MDDMspc, MDDMproj and MLFSIE algorithm.MDDMspc and MDDMproj
Although algorithm has obtained less characteristic, but also eliminated and some the relevant spies that classify during removing redundancy feature
Sign causes its classifying quality to decline very much, and the purpose for carrying out dimensionality reduction with us conflicts.
To sum up, according to 16 comparing results (4 evaluation indexes and 4 algorithms) in table, method of the invention has 100%
Situation can obtain optimal value.The above analysis of experimental results fully shows the character subset that institute's extracting method of the present invention obtains and lures
The classification performance that export comes is to be substantially better than other three kinds comparison algorithms.
Although present disclosure is discussed in detail by above preferred embodiment, but it should be appreciated that above-mentioned
Description is not considered as limitation of the present invention.After those skilled in the art have read the above, for the present invention's
A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.
Claims (8)
1. a kind of saccharomycete multiple labeling feature selection approach based on particle group optimizing, which is characterized in that include the following steps:
Saccharomycete sample data set is extracted, the saccharomycete sample data set includes multiple saccharomycete sample characteristics matrixes and sample
Mark matrix;
The characteristic for extracting saccharomycete sample data set, initializes binary-coded population, and initialize population
Position and speed;
By the correlation between redundancy, feature and the label between measures characteristic and feature, the phase between label and label
Guan Xing constructs the CFS interpretational criteria functions of binding marker correlation;
According to the CFS evaluation functions of the binding marker correlation, the adaptive value of each particle is calculated;
To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if better than living through
Optimal location pbest, then the optimal location pbest lived through the adaptive value of the calculating as it;
And using the optimal location pbest of all particles as the optimal location gbest of group;
The position and speed of more new particle is iterated, and the optimal location gbest intermediate values of finally obtained group are corresponding to 1
Feature is the optimal feature subset of saccharomycete data set.
2. the saccharomycete multiple labeling feature selection approach according to claim 1 based on particle group optimizing, which is characterized in that
The position and speed of the more new particle includes:
Judge whether to meet t < γ Niter, random numbers of the wherein γ between [0,1], NiterFor iteration total degree;
If t < γ Niter, then the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,The speed of i-th of particle, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ..., N are tieed up for j in the t times iterationiter;
Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide the position of particle;
Otherwise, the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ...,
Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide particle
Position.
3. the saccharomycete multiple labeling feature selection approach according to claim 1 based on particle group optimizing, which is characterized in that
The CFS evaluation functions of the binding marker correlation are:
Wherein, CFS (S) is the evaluation of estimate of the candidate subset S comprising k feature;For saccharomycete candidate feature subset S and label
Collect the average correlation between L,For the average correlation between saccharomycete label sets L and label sets L,It is waited for saccharomycete
Select the average redundancy between feature in character subset S.
4. the saccharomycete multiple labeling feature selection approach according to claim 1 based on particle group optimizing, which is characterized in that
Further include the steps that the control of population that position is 1 it is n before the adaptive value for calculating each particle:
Count the positional number h that position is 1 in each particle:
If h > n, the position that h-n value is 1 is changed to 0 at random;
If h < n, the position that n-h value is 0 is changed to 1 at random.
5. a kind of saccharomycete multiple labeling feature selecting device based on particle group optimizing, which is characterized in that described including processor
Processor realizes following method for executing instruction:
Saccharomycete sample data set is extracted, the saccharomycete sample data set includes multiple saccharomycete sample characteristics matrixes and sample
Mark matrix;
The characteristic for extracting saccharomycete sample data set, initializes binary-coded population, and initialize population
Position and speed;
By the correlation between redundancy, feature and the label between measures characteristic and feature, the phase between label and label
Guan Xing constructs the CFS interpretational criteria functions of binding marker correlation;According to the CFS evaluation functions of the binding marker correlation,
Calculate the adaptive value of each particle;
To each particle, the optimal location pbest that the adaptive value and its that are calculated live through is compared, if being better than optimal position
Pbest is set, then the optimal location pbest lived through the adaptive value of the calculating as it;
And using the optimal location pbest of all particles as the optimal location gbest of group;
The position and speed of more new particle is iterated, and the optimal location gbest intermediate values for finally obtaining group are the spy corresponding to 1
Sign is the optimal feature subset of saccharomycete data set.
6. the saccharomycete multiple labeling feature selecting device according to claim 5 based on particle group optimizing, which is characterized in that
The position and speed of the more new particle includes:
Judge whether to meet t < γ Niter, random numbers of the wherein γ between [0,1];NiterFor iteration total degree;
If t < γ Niter, then the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ...,
Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide the position of particle;
Otherwise, the position that j i-th of particle of dimension is updated in the t times iteration is:
Wherein,For in the t times iteration j tie up i-th of particle speed, i=1,2 ..., m, j=1,2 ..., D, t=1,2 ...,
Niter;Rand () is equally distributed random function, and between zero and one, each iteration can all regenerate value; For logistic functions, according to the speed of particleProvide particle
Position.
7. the saccharomycete multiple labeling feature selecting device according to claim 5 based on particle group optimizing, which is characterized in that
The CFS evaluation functions of the binding marker correlation are:
Wherein, CFS (S) is the evaluation of estimate of the candidate subset S comprising k feature;For saccharomycete candidate feature subset S and label
Collect the average correlation between L,For the average correlation between saccharomycete label sets L and label sets L,It is waited for saccharomycete
Select the average redundancy between feature in character subset S.
8. the saccharomycete multiple labeling feature selecting device according to claim 5 based on particle group optimizing, which is characterized in that
Further include the steps that the control of population that position is 1 it is n before the adaptive value for calculating each particle:
Count the positional number h that position is 1 in each particle:
If h > n, the position that h-n value is 1 is changed to 0 at random;
If h < n, the position that n-h value is 0 is changed to 1 at random.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810380973.0A CN108805162A (en) | 2018-04-25 | 2018-04-25 | A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810380973.0A CN108805162A (en) | 2018-04-25 | 2018-04-25 | A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108805162A true CN108805162A (en) | 2018-11-13 |
Family
ID=64092989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810380973.0A Pending CN108805162A (en) | 2018-04-25 | 2018-04-25 | A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108805162A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110211638A (en) * | 2019-05-28 | 2019-09-06 | 河南师范大学 | A kind of Gene Selection Method and device considering gene-correlation degree |
CN111340741A (en) * | 2020-01-03 | 2020-06-26 | 中北大学 | Particle swarm optimization gray level image enhancement method based on quaternion and L1 norm |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678680A (en) * | 2013-12-25 | 2014-03-26 | 吉林大学 | Image classification method based on region-of-interest multi-element spatial relation model |
CN105608004A (en) * | 2015-12-17 | 2016-05-25 | 云南大学 | CS-ANN-based software failure prediction method |
CN106991447A (en) * | 2017-04-06 | 2017-07-28 | 哈尔滨理工大学 | A kind of embedded multi-class attribute tags dynamic feature selection algorithm |
CN107541544A (en) * | 2016-06-27 | 2018-01-05 | 卡尤迪生物科技(北京)有限公司 | Methods, systems, kits, uses and compositions for determining a microbial profile |
-
2018
- 2018-04-25 CN CN201810380973.0A patent/CN108805162A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678680A (en) * | 2013-12-25 | 2014-03-26 | 吉林大学 | Image classification method based on region-of-interest multi-element spatial relation model |
CN105608004A (en) * | 2015-12-17 | 2016-05-25 | 云南大学 | CS-ANN-based software failure prediction method |
CN107541544A (en) * | 2016-06-27 | 2018-01-05 | 卡尤迪生物科技(北京)有限公司 | Methods, systems, kits, uses and compositions for determining a microbial profile |
CN106991447A (en) * | 2017-04-06 | 2017-07-28 | 哈尔滨理工大学 | A kind of embedded multi-class attribute tags dynamic feature selection algorithm |
Non-Patent Citations (2)
Title |
---|
刘建华等: "离散二进制粒子群算法分析", 《南京大学学报(自然科学)》 * |
赵磊: "基于随机搜索策略的多标签特征选择方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110211638A (en) * | 2019-05-28 | 2019-09-06 | 河南师范大学 | A kind of Gene Selection Method and device considering gene-correlation degree |
CN110211638B (en) * | 2019-05-28 | 2023-03-24 | 河南师范大学 | Gene selection method and device considering gene correlation |
CN111340741A (en) * | 2020-01-03 | 2020-06-26 | 中北大学 | Particle swarm optimization gray level image enhancement method based on quaternion and L1 norm |
CN111340741B (en) * | 2020-01-03 | 2023-05-09 | 中北大学 | Particle swarm optimization gray image enhancement method based on quaternion and L1 norm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11803591B2 (en) | Method and apparatus for multi-dimensional content search and video identification | |
CN110363282B (en) | Network node label active learning method and system based on graph convolution network | |
Song et al. | A hybrid evolutionary computation approach with its application for optimizing text document clustering | |
US20180018566A1 (en) | Finding k extreme values in constant processing time | |
CN109359135B (en) | Time sequence similarity searching method based on segment weight | |
CN105095494A (en) | Method for testing categorical data set | |
Sharma et al. | Hierarchical maximum likelihood clustering approach | |
Zeng et al. | Learning a mixture model for clustering with the completed likelihood minimum message length criterion | |
Killamsetty et al. | Automata: Gradient based data subset selection for compute-efficient hyper-parameter tuning | |
CN108805162A (en) | A kind of saccharomycete multiple labeling feature selection approach and device based on particle group optimizing | |
Guo et al. | Dual-view ranking with hardness assessment for zero-shot learning | |
Daniel Loyal et al. | A Bayesian nonparametric latent space approach to modeling evolving communities in dynamic networks | |
CN114897085A (en) | Clustering method based on closed subgraph link prediction and computer equipment | |
CN105205349B (en) | The Embedded Gene Selection Method based on encapsulation of Markov blanket | |
Zhang et al. | A new data selection principle for semi-supervised incremental learning | |
CN116340839B (en) | Algorithm selecting method and device based on ant lion algorithm | |
CN115208651B (en) | Flow clustering anomaly detection method and system based on reverse habituation mechanism | |
CN110796198A (en) | High-dimensional feature screening method based on hybrid ant colony optimization algorithm | |
Landgrebe et al. | The ROC skeleton for multiclass ROC estimation | |
Gertheiss et al. | Feature selection and weighting by nearest neighbor ensembles | |
Devi et al. | An efficient document clustering using hybridised harmony search K-means algorithm with multi-view point | |
Abudalfa et al. | Semi-supervised target-dependent sentiment classification for micro-blogs | |
Xie et al. | Churn prediction with linear discriminant boosting algorithm | |
Ouadfel et al. | Bio-inspired algorithms for multilevel image thresholding | |
Wu et al. | Improved prior selection using semantics in maximum a posteriori for few-shot learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181113 |
|
RJ01 | Rejection of invention patent application after publication |