CN105279520A

CN105279520A - Optimal character subclass selecting method based on classification ability structure vector complementation

Info

Publication number: CN105279520A
Application number: CN201510621401.3A
Authority: CN
Inventors: 王淑琴
Original assignee: Tianjin Normal University
Current assignee: Tianjin Normal University
Priority date: 2015-09-25
Filing date: 2015-09-25
Publication date: 2016-01-27
Anticipated expiration: 2035-09-25
Also published as: CN105279520B

Abstract

The invention aims to solve a problem that most of conventional methods take a single value as the classification ability evaluation standard of characters or character subclasses and provides an optimal character subclass selecting method based on classification ability structure vector complementation. The method defines a character classification ability structure vector V=(V1:V2: ...:Vn) in a binary form and classification ability structure vector complementation characters, adopts the dichotomy for calculating the character classification distinguishing ability threshold of each subclass problem, and performs optimal character subclass selection according to the structure complementation maximization principle and the greedy strategy of different characters in the selected character subclass on the basis. The method fully considers different evaluations of each character on classification abilities of different types, and accords with the classification ability structure complementation maximization principle in the character selection process. The method accords with natural laws of advantage complementation and maximizes character classification information, thereby obtaining a better character subclass, effectively reducing redundant characters, and improving the classification prediction accuracy.

Description

Based on the optimal feature subset choosing method of classification capacity structure vector complementation

Technical field

The invention belongs to machine learning and mode identification technology, a kind of rationally effective feature subset selection method of concrete proposition.

Background technology

Feature Selection is one of two main method of Dimensionality Reduction.It plays vital effect in machine learning and pattern-recognition, is also one of underlying issue wherein studied, and is data prediction step crucial in structural classification device.Feature Selection chooses on original characteristic set the significant character subset of classification according to some evaluation criterions thus removes irrelevant or redundancy feature, thus the m be down to by the dimension in former space much smaller than former dimension ties up.Along with developing rapidly of internet and high-throughput techniques, instantly entered large data age, data are enormous amount and numerous and complicated feature, and it is even more important at society that this also makes Feature Selection Algorithms study.In recent years, one of principal focal point problem faced when the research of Feature Selection Algorithms has become the data set be applied to containing thousands of features.Feature Selection can make data be easier to understand, and reduces tolerance and memory requirement, reduces training and realize the time, improve estimated performance etc.In this research direction, how evaluating characteristic classification capacity and the choosing method obtaining effective character subset is a key.

In recent years, domestic scholars has carried out a large amount of research work to Feature Selection, and domestic periodical has also published lot of research.All there is a something in common in these Feature Selection Algorithms, namely various classification capacity measure is all a score value feature or character subset being provided to the classification capacity size describing this feature or character subset.It has been generally acknowledged that the classification capacity of the feature that aspect ratio score value that score value is large is little is strong, the feature that thus score value is large also will preferentially be chosen.But a few thing has shown that the feature that some score values are little also should be selected, and some have the classification results also always do not obtained compared with the combination of the feature of high-class ability value.With the comprehensive evaluation that single value representation feature classification capacity size is only to this tagsort ability, and have ignored each feature for the different evaluation of different classes of classification capacity.

Summary of the invention

The present invention, in order to solve the problem in above-mentioned existing method, proposes a kind of optimal feature subset choosing method based on the complementation of classification capacity structure vector newly.The present invention obtains corresponding vectorization classification capacity by evaluating each feature to the classification capacity of different subclass problem, namely by the separating capacity of multiple value representation feature to different subclass problem, then according to the principle selected characteristic of classification capacity structure vector complementation or character subset.The present invention is applicable to multiclass and number of samples is predicted much smaller than data set such as the classification of cancer data collection etc. of Characteristic Number.For breast cancer data set, validity of the present invention will be described in a specific embodiment.

The present invention is by the tagsort ability structure vector of definition based on binary mode and the feature of classification capacity complementary structure, adopt dichotomy to calculate the threshold value of each subclass problem respectively, and carry out optimal feature subset according to the structural complementarity maximization principle of different characteristic in selected character subset and Greedy strategy on this basis and choose.This method had both met the natural law of mutual supplement with each other's advantages, also tagsort information can be performed to ultimate attainment, thus obtain better character subset.Therefore, research considers that the Feature Selection Algorithms of classification capacity structural complementarity has very large meaning.

For achieving the above object, the invention discloses following technology contents:

A kind of optimal feature subset choosing method based on the complementation of classification capacity structure vector, it is characterized in that first the method will carry out the definition of the tagsort ability structure vector based on binary mode, and complete the calculating of each tagsort ability structure vector, concrete steps are as follows:

For having individual feature the classification problem of class , first adopt 1-vs-1 form to be translated into be made up of any two classes individual two classification subproblems, wherein , then adopt Fisher differentiation rate as feature antithetical phrase Question Classification separating capacity value, be called for short FDR value, be designated as , calculate each feature respectively to the classification separating capacity of individual subproblem , wherein, ; Last will own according to the threshold value of following threshold value calculation method acquisition class separating capacity again be converted into 0 or 1, thus obtain the classification separating capacity structure vector of each feature to each subproblem.

Wherein adopt dichotomy to calculate the threshold value of each subclass problem respectively, concrete steps are as follows:

Because the classification capacity of each feature to each subclass problem is different, therefore to each subclass problem calculated threshold, can obtain so respectively individual threshold value, in order to reduce the time complexity of threshold calculations, adopts better simply binary search method, with calculate by class form the the threshold value of individual subclass problem is example, and the computation process of threshold value is described;

First threshold value is set initial value be , be all features to the average of the classification separating capacity of individual subclass problem; To all features by classification separating capacity carry out descending sort, and maximal value and minimum value are assigned to variable ;

The optimal feature subset carried out again based on Greedy strategy and classification capacity complementary structure is chosen on this basis.Concrete steps are as follows:

After definite threshold, separating capacity of classifying is greater than the union of the attribute of threshold value as initial characteristics subset in all subproblems.For feature each in initial characteristics subset , and classification capacity structure vector, calculate the separating capacity that it is total, that is, weighted sum is asked, as total classification capacity to the FDR value that its structure vector component is the subproblem of 1.By total classification capacity of water to the feature descending sort of initial subset.

Choose successively from front to back each feature in initial characteristics subset and with the character subset chosen all features compare, if all complementary with the characteristic classification capacity structure vector of institute in selected character subset, then directly choose enter character subset, namely ; Otherwise, for all the feature that classification capacity structure vector covers, after the sample mistakenly hit vector calculating each feature respectively and sample total mistakenly hit vector or computing, selection can make the number of in sample total mistakenly hit vector 1 increase maximum features and enter character subset, if all features all can not make the change of sample total mistakenly hit vector, then do not select.Repeatedly perform said process until the total mistakenly hit vector of sample is unit vector, then character subset is the optimal feature subset chosen.

The concept relevant with the present invention and definition.

Subclass problem:

If given, there is individual feature the classification problem of class, for characteristic set, for classification, adopt 1-vs-1 form to be translated into be made up of any two classes individual two classification subproblems, wherein .Each two classification subproblems are wherein called subclass problem.

Tagsort ability:

The tolerance of certain feature to the classification capacity of classification problem.The present invention adopts the Fisher differentiation rate of feature, namely as feature to subproblem classification capacity value, referred to as FDR value, wherein feature in sample mean value, and feature in class sample mean value, feature respectively variance on two class samples.

Tagsort ability structure vector:

The classification capacity FDR value of certain feature to all subproblems just constitutes a vector, and this vector is called as the classification capacity structure vector of this feature.In order to simplify the complexity of calculating, the present invention adopts the classification capacity structure vector based on binary mode, is designated as:

。

Need to arrange threshold value the classification capacity FDR value of each feature to each subproblem is converted into 0 or 1.

Feature in the present invention classification capacity structure vector in subproblem the computing formula of component be defined as follows:

Sample mistakenly hit vector:

In order to the threshold value and feature subset selection that calculate each subclass problem respectively also introduce sample mistakenly hit vector, to classify all samples to make selected subset.

If a sample belonging to 1 class, its feature be worth the feature at all samples of 2 class between the minimum value of value and maximal value, then think that this 1 class sample is by feature otherwise for hitting.

Then feature in individual subproblem sample mistakenly hit vector be designated as:

, 0 represents that corresponding to this component, sample is by mistakenly hit, and 1 expression is hit.And uniquely determine.

By feature whole subproblems sample mistakenly hit vector connect come in, as feature sample mistakenly hit vector.

1. cover:

Suppose two features with structure vector be respectively if had

So just claim feature cover feature, is designated as , otherwise feature do not cover feature, is designated as .

2. classification capacity structure vector complementary characteristic:

For feature with if, set up, then claim these two features to be classification capacity structure vector complementary characteristics, be designated as .

3. initial characteristics subset sums optimal feature subset

Initial characteristics subset: after definite threshold, is greater than the union of the attribute of threshold value as initial characteristics subset using separating capacity of classifying in all subproblems.

Optimal feature subset: the character subset chosen according to the tagsort ability structure complementary maximization principle of vector and Greedy strategy in initial characteristics subset is called optimal feature subset.

The good effect compared with prior art had based on the feature subset selection method of classification capacity structure vector complementation disclosed by the invention is:

(1) choosing method of the present invention not only takes into full account that each feature is for the different evaluation of different classes of classification capacity, and in Feature Selection process, follow the maximized principle of classification capacity structural complementarity.This method had both met the natural law of mutual supplement with each other's advantages, also tagsort information can be performed to ultimate attainment, thus obtain better character subset, effectively to reduce redundancy feature, improved the accuracy rate of classification prediction.

(2) choosing method of the present invention can solve the comprehensive evaluation that classification capacity measure in existing Feature Selection Algorithms is all the classification capacity using single value as a feature or character subset, and have ignored the problem of each feature for the different evaluation of different classes of classification capacity.The results show effectively can reduce redundancy feature based on the feature subset selection method of classification capacity structure vector complementation, and improving the accuracy rate of classification prediction, is effective.

(3) the present invention can be used for the classification prediction of cancer data collection, improves predictablity rate, is conducive to the important gene finding to cause cancer to occur, to such an extent as to better study the targeted drug of Therapeutic cancer.

Accompanying drawing explanation

Fig. 1 classification problem data set ;

Fig. 2 is the algorithm flow chart based on dichotomy calculated threshold.

Embodiment

In order to explain enforcement of the present invention more fully, below by drawings and Examples, the present invention is described further.These embodiments are only explain instead of limit the scope of the invention.

Embodiment 1

1. read classification problem data set.

Usual classification problem data set is a two-dimensional matrix, such as, have and such as have individual feature class the classification problem of individual sample data set as shown in Figure 1, wherein represent the of individual sample the eigenwert of individual feature, represent the the classification of individual sample.Table 1 shows the expression value of the Partial Feature gene of breast cancer breast data centralization part sample, wherein the second behavior sample classification, the third line is the expression value of first feature on each sample, the rest may be inferred for other row, a sample is shown in one list, i.e. each feature representation value of someone and classification.All eigenwerts of each for data centralization sample are read two-dimensional array , the classification of each sample is read one-dimension array in.

The expression value of the Partial Feature gene of table 1 breast cancer breast data centralization part sample

2. calculate each feature to the classification separating capacity value of each subclass problem, namely .

First adopt 1-vs-1 form multicategory classification problem to be converted into be made up of any two classes individual two classification subproblems, wherein.Adopt Fisher differentiation rate as feature antithetical phrase Question Classification separating capacity value again,

Then feature:

to the classification separating capacity of individual subproblem, is designated as if, the comprising classification in individual subproblem is sample, then

Computing formula is as follows:

，

Wherein feature in class sample mean value, and the mean value of feature in class sample, feature respectively variance on two class samples.

According to above-mentioned computing method, calculate each feature respectively to the classification separating capacity of individual subproblem , wherein, .The classification separating capacity of each like this feature to each subproblem just constitutes a vector, , be called tagsort ability structure vector.

3. adopt dichotomy to calculate the threshold value of each subclass problem respectively.

Because the classification capacity of each feature to each subclass problem is different, therefore to each subclass problem calculated threshold, can obtain so respectively individual threshold value, in order to reduce the time complexity of threshold calculations, adopts better simply binary search method.With calculate by the threshold value of the subclass problem formed is example, and the computation process of threshold value is described, its respective algorithms process flow diagram as shown in Figure 2.

First threshold value is set initial value be ; To all features by classification separating capacity value carries out descending sort, and its maximal value and minimum value are assigned to variable

Getting the average of all features to the FDR value of this subclass problem is initial threshold , Flag=0.

To own the component that value is less than corresponding subproblem in the classification capacity structure vector of the feature of this threshold value is clearly 0, and the respective components being greater than the classification capacity structure vector corresponding to feature of this threshold value is set to 1.

Be the feature of 1 to all classification capacity structure components , calculate their mistakenly hit vectors or, namely if be vector of unit length and Flag=0, then to get entire infrastructure component be the average of the value of the feature of 1 is new threshold value , will own value is less than corresponding subproblem in the classification capacity structure vector of the feature of this threshold value component be clearly 0.Else if be not vector of unit length, then getting entire infrastructure component is the feature of 0 the average of value is threshold value , Max is updated to former , the respective components of the classification capacity structure vector corresponding to the feature being greater than this threshold value is set to 1, simultaneously Flag=1.

Be the feature of 1 to all classification capacity structure components again , calculate their mistakenly hit vectors or, namely .

Repeatedly perform this process until make be vector of unit length and till Flag=1.Then threshold value now be designated as last threshold value.

4. the optimal feature subset based on Greedy strategy and classification capacity complementary structure is chosen, and its algorithm is as shown in algorithm 1.

Separating capacity of classifying in all subproblems is greater than the union of the attribute of threshold value as initial characteristics subset after determining by the threshold value of each subproblem.For feature each in initial characteristics subset and classification capacity structure vector, calculate the separating capacity that it is total, that is, weighted sum is asked, as total classification capacity to the FDR value that its structure vector component is the subproblem of 1.By total classification capacity of water to the feature descending sort of initial subset.

By classification capacity structure vector, calculate feature calculation feature to the mistakenly hit vector of whole sample, that is, for a certain subproblem if this feature is corresponding structure vector component be 1, then this subproblem sample is corresponding mistakenly hit vector is the mistakenly hit vector in the former algorithm if structure vector component is 0, then the mistakenly hit vector of the subproblem of its correspondence is 0 vector.The mistakenly hit of whole subproblem vector is connected to come in, as feature mistakenly hit vector.

Choose successively from front to back each feature in initial characteristics subset , and with the character subset chosen all features compare, if all complementary with the characteristic classification capacity structure vector of institute in selected character subset, then directly choose enter character subset, namely otherwise for all the feature that classification capacity structure vector covers, calculates the sample mistakenly hit vector of each feature respectively and the total mistakenly hit of sample is vectorial or after computing, select to make the number of in sample total mistakenly hit vector 1 increase maximum features and enter character subset.If all features all can not make the change of sample total mistakenly hit vector, then do not select.Repeatedly perform said process until the total mistakenly hit vector of sample is unit vector, then character subset for the optimal feature subset chosen.

algorithm 1: choose based on the optimal feature subset of Greedy strategy and classification capacity complementary structure

Input: the classification separating capacity threshold value of each subproblem;

Export: optimal feature subset

Initialization set sample total mistakenly hit vector Hit is 0 vector;

The each feature of For if then

The each feature of For

Calculate the separating capacity that it is total

By total classification capacity of water to the feature descending sort of initial subset

Calculate each feature in CF vectorial to the mistakenly hit of whole sample

do

The each feature of For then

If feature mistakenly hit vector

else

The total mistakenly hit vector of max=sample or after computing in vector 1 number,

The each feature of For

if then

Calculate feature sample mistakenly hit vector with the total mistakenly hit vector of sample or after computing in vector 1 number b

Ifb>maxthen

If feature mistakenly hit vector

while

Embodiment 2

Experimental result of the present invention and data:

Experimental data collection-breast cancer (breast) of the present invention, downloaded from http://www.ccbm.jhu.edu/, see list of references in 2007.Breast data set contains 5 classifications, 9216 features and 54 samples.Traditional objective evaluation index is adopted to carry out the performance of testing algorithm, mainly contain number and the classification predictablity rate of selected characteristic, wherein the number of Feature Selection refers to the number of feature using Feature Selection Algorithms to choose, and classification predictablity rate is the accuracy rate obtained as the input of sorter by the character subset chosen.In order to verify the validity of the method that the present invention proposes, the attribute selection method such as it and existing FCBF, CFS, mrMr, Relief is compared.Just feature evaluated due to mrMr and Relief method and provide ranking results, instead of character subset, in order to the character subset classification results can chosen with the inventive method compares, respectively characteristic evaluation method in CFS, mrMr and Relief and FCBF feature subset selection methods combining are obtained CFS_FCBF, mrMr_FCBF and Relief_FCBF feature subset selection method.In order to represent the necessity of Feature Selection, the method also the present invention proposed with directly use whole feature to carry out classifying to predict that the result of (Orig) compares.The sorter used has naive Bayesian (NB), support vector machine (SVM), k neighbour (KNN), decision tree (C4.5), random forest (RF) and simple classification and regression tree (SCart).

Show the rank of the feature in the optimal feature subset chosen on breast data set of method using the present invention to propose, rank in the subproblem of feature place and FDR total value in table 2, and the rank of these features in other comparative approach with whether selected.As can be seen from Table 2 the present invention propose method choice feature in be all in subproblem rank above, but Partial Feature is but in final ranking and after other existing method ranks.Such as feature 8715_A8715,9063_A9063, although in final ranking ranking rearward, in subproblem, rank is front, therefore to be chosen by the method that the present invention proposes, but additive method is not selected.

Table 3 gives the comparison of distinct methods number of selected characteristic on breast data set.The method that table 4 shows the present invention's proposition is classified comparing of predictablity rate with additive method.

Can find out that from table 3 and table 4 method that the present invention proposes is better than other four kinds of methods.The method that the present invention proposes not only have selected relatively less feature, and all obtains the highest nicety of grading on each sorter.It can also be seen that and use Feature Selection Algorithms select to carry out after character subset the to classify result of prediction to be better than not using Feature Selection.

All these method indicating the present invention's proposition is effective, can obtain good character subset.

List of references:

A.C.Tan,D.Q.Naiman,L.Xu,R.L.Winslow,D.Geman,Simpledecisionrulesforclassifyinghumancancersfromgeneexpressionprofiles,Bioinformatics,2005,21(20):3896–3904.

The optimal feature subset that the method that table 2 the present invention proposes is chosen on breast data set

The number of table 3 distinct methods selected characteristic on breast data set

The method that table 4 the present invention proposes is classified comparing of predictablity rate with additive method

Claims

1. based on the optimal feature subset choosing method of classification capacity structure vector complementation, it is characterized in that, the method concrete steps are as follows:

The first step: define the tagsort ability structure vector based on binary mode and classification capacity complementary structure feature, calculate each tagsort ability structure vector;

Second step: adopt dichotomy to calculate the tagsort capacity threshold of each subclass problem respectively;

3rd step: carry out optimal feature subset according to the structural complementarity maximization principle of different characteristic in selected character subset and Greedy strategy on the basis of the above-described procedure and choose; The calculation procedure of wherein said tagsort ability structure vector is as follows:

For having individual feature the classification problem of class , first adopt 1-vs-1 form to be translated into be made up of any two classes individual two classification subproblems, wherein , then adopt Fisher differentiation rate as feature antithetical phrase Question Classification separating capacity value, be called for short FDR value, be designated as , calculate each feature respectively to the classification separating capacity of individual subproblem , wherein, ; Last will own according to the threshold value of following threshold value calculation method acquisition class separating capacity again value is converted into 0 or 1, thus obtains the classification separating capacity structure vector of each feature to each subproblem; The calculation procedure of described subclass problem threshold value is as follows:

Because the classification capacity of each feature to each subclass problem is different, therefore to each subclass problem calculated threshold, can obtain so respectively individual threshold value, in order to reduce the time complexity of threshold calculations, adopts better simply binary search method, with calculate by class and class form the the threshold value of individual subclass problem is example, and the computation process of threshold value is described;

First threshold value is set initial value be , be all features to the average of the classification separating capacity of individual subclass problem; To all features by classification separating capacity carry out descending sort, and maximal value and minimum value are assigned to variable ; Getting the average of all features to the FDR value of this subclass problem is initial threshold , Flag=0;

To own value is less than corresponding subproblem in the classification capacity structure vector of the feature of this threshold value component be clearly 0, and the respective components being greater than the classification capacity structure vector corresponding to feature of this threshold value is set to 1; Be the feature of 1 to all classification capacity structure components calculate their mistakenly hit vectors or, namely ; If be vector of unit length and Flag=0, then getting entire infrastructure component is the feature of 1 the average of value is new threshold value to own the component that value is less than corresponding subproblem in the classification capacity structure vector of the feature of this threshold value is clearly 0, else if be not vector of unit length, then getting entire infrastructure component is the feature of 0 , Max is updated to former , the respective components of the classification capacity structure vector corresponding to the feature being greater than this threshold value is set to 1, simultaneously Flag=1; Be the feature of 1 to all classification capacity structure components again , calculate their mistakenly hit vectors or, namely ; Repeatedly perform this process until make be vector of unit length and till Flag=1, then threshold value now be designated as last threshold value;

Described optimal feature subset choosing method step is as follows:

After definite threshold, separating capacity of classifying is greater than the union of the attribute of threshold value as initial characteristics subset in all subproblems;

For feature each in initial characteristics subset and classification capacity structure vector, calculate the separating capacity that it is total, that is, weighted sum is asked to the FDR value that its structure vector component is the subproblem of 1, as total classification capacity, by total classification capacity of water to the feature descending sort of initial subset; Choose successively from front to back each feature in initial characteristics subset and with the character subset chosen all features compare, if all complementary with the characteristic classification capacity structure vector of institute in selected character subset, then directly choose enter character subset, namely otherwise, for all the feature that classification capacity structure vector covers, after the sample mistakenly hit vector calculating each feature respectively and sample total mistakenly hit vector or computing, selection can make the number of in sample total mistakenly hit vector 1 increase maximum features and enter character subset, if all features all can not make the change of sample total mistakenly hit vector, then do not select, repeatedly perform said process until the total mistakenly hit vector of sample is unit vector, then character subset for the optimal feature subset chosen.