CN101840516A

CN101840516A - Feature selection method based on sparse fraction

Info

Publication number: CN101840516A
Application number: CN201010157827A
Authority: CN
Inventors: 杨杰; 朱林
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2010-04-27
Filing date: 2010-04-27
Publication date: 2010-09-22

Abstract

The invention relates to a feature selection method based on sparse fraction in the technical field of information processing. The method comprises the following steps: extracting a data set to be processed; adopting the L1 norm minimizing method to obtain the reconfiguration coefficient of sparse representation of each data item in the data set to be processed; performing reconfiguration error-accumulating treatment to each dimension character of each data item in the data set to be processed and the reconfiguration coefficient of sparse representation of the corresponding data item so as to obtain the sparse fraction of each dimension character of the data set to be processed; and arranging the dimension characters of the data set to be processed from small to large order of the sparse fractions of the dimension characters, wherein the feature of the least sparse fraction is the most important feature of the data set to be processed. The feature selection method has better robust performance to noises and outliers, does not require any priori information and has higher applicability, thus effectively improving the classification forecasting performance. The invention can be widely used in pattern recognition, various categories of machine learning and data mining, and the visual problem of cluster and data.

Description

Feature selection approach based on sparse fraction

Technical field

What the present invention relates to is a kind of method of technical field of information processing, specifically is a kind of feature selection approach based on sparse fraction.

Background technology

Feature selecting (feature selection) is exactly that to pick out some the most effective features from a stack features be that k (come by one group of optimal characteristics of k＜d) to reach the purpose that reduces the feature space dimension, promptly select quantity in the feature that one group of quantity is d.Appraisal procedure according to answer, feature selecting generally can roughly be divided into two big classes, one class methods are that feature selecting is separated with assorting process, and the process of feature selecting and concrete sorter are irrelevant, and these class methods are called filter-type method (filter approach); An other class is encapsulation type method (wrapper approach), and these class methods merge the design of feature selecting and sorter, and the sorter of feature selecting performance and employing is closely related.Difference maximum between them has been to use different assessment strategies.Two class methods differ from one another and be not enough: filtering type method realizes simple, the efficient height, but be used for Classification and Identification, then nicety of grading is not as the encapsulation type method; The encapsulation type method has higher classification performance, but the efficient of feature selecting is lower, and is not so good as filter method for different sorter transplantabilities.

The feature selection approach of early stage filter-type based on fractional equation mainly contains based on the feature selecting of variance mark (Variance Score) with based on the feature selecting of Fisher mark (Fisher Score).These two kinds all is the simplest and the most widely used feature selection approach.Adopt the feature of the feature selection approach of variance mark (Variance Score), give up the thought of the less feature of variance based on the concentrated variance maximum of retention data.Because the variance of certain one-dimensional characteristic has reflected the representative degree of this one dimension to whole data set, therefore,, choose the bigger feature of variance again according to the big minispread of its variance by calculating the variance situation of each dimensional feature, just reached the purpose of feature selecting.Because the method for variance mark has only been utilized the variance information of data, simple relatively, the result of the character subset that obtains on the complex data collection is very ineffective.And the method for variance mark is non-supervision in essence, does not utilize the classification information of sample.Afterwards, the someone had proposed the feature selection approach of Fisher mark again.The thought of this method is to seek to be used for the effectively feature of classification, if promptly data set is on certain dimensional feature, similarity is bigger in the class, and similarity is less between class, just thinks that this feature is good feature; If instead similarity is less in the class on this feature, and similarity is bigger between class, think that then this feature is not good feature.The method of Fisher mark has been utilized the classification information of data, is a kind of feature selection approach that supervision is effectively arranged.But experimental result shows under a stable condition, and the method for Fisher mark is undesirable for the multimodality (be in certain class sample be made of several independent clusters) and the processing of the class problem that peels off.

Through existing literature search is found, external people (Xiaofei He such as Xiaofei He, Deng Cai, and ParthaNiyogi., " Laplacian Score for Feature Selection (based on the feature selection approach of Laplce's mark) ", Advances in Neural Information Processing Systems 18 (NIPS 2005), Vancouver, Canada, 2005) in international neural information processing systems conference in 2005, proposed to utilize the Laplacian mark to carry out the method for feature selecting, this method is based on the thought of the local reserve capability of comparative feature, suppose that promptly a kind of good feature should be if two data points are approximate, then these two data points also should be similar on this feature, and the feature that satisfies this condition has the ability of stronger representative raw data.By calculating the Laplacian mark of every dimensional feature, the size according to mark sorts again like this, selects the less feature of fractional value, thereby carries out feature selecting.Experimental result shows that the method for Laplacian mark is also relatively more responsive to noise data, is vulnerable to the influence of noise spot.

Find by retrieval again, domestic people such as Zhang Daoqiang (Zhang DQ, Chen SC, Zhou ZH:Constraint Score:A new filter method for feature selection with pairwise constraints.PatternRecognition 41 (5): 1440-1451 2008, (Zhang Daoqiang, Chen Songcan, Zhou Zhihua, " based on the feature selection approach of constraint mark "), international the 41st the 5th phase of volume of pattern-recognition magazine, 2008) utilize the criterion of constraint mark also to propose a kind of method of new feature selecting.This method has utilized the supervision message (pairwise constraints) between data to carry out feature selecting.If promptly data belong to same class, then there is connection constraints (must-link) between these data, therefore good feature this moment should be nearer; Otherwise, if data do not belong to same class, then do not deposit connection constraints (cannot-link) between these data, therefore good feature this moment should be far away.Similar with Laplce's mark, the constraint mark also is to have utilized neighbor relationships that feature is judged, carries out effective feature selecting.But because the method for constraint mark has been utilized the supervision message between sample, therefore the connection constraints that need exist between specific data is in advance lacking under the Given information situation, and applicable performance is not fine.

Summary of the invention

The objective of the invention is to overcome the above-mentioned deficiency of prior art, a kind of feature selection approach based on sparse fraction is provided.The present invention utilizes the sparse restructuring matrix between sample, obtains the rarefaction representation reserve capability of each dimensional feature of data, thereby has proposed a kind of new feature selection approach.This method is owing to utilized the characteristics of the unchangeability of rotation of rarefaction representation coefficient between data and dimensional variation, therefore advantage with discriminant information reserve capability of better feature can well be applied in the forecasting problems such as the classification of pattern-recognition, machine learning and cluster.

The present invention is achieved by the following technical solutions, the present invention includes following steps:

The first step is extracted pending data set { x _i} _I=1 ⁿ, data centralization has n data, and each data comprises the m dimensional feature.

Second step, adopt the minimized method of L1 norm, obtain pending data set { x _i} _I=1 ⁿIn the reconstruction coefficients of rarefaction representation of each data.

The minimized method of described L1 norm, specifically:

\min_{s_{i}} {| | s_{i} | |}_{1},

s.t.x _i＝Xs _i

Wherein: X is pending data set, X=[x ₁, x ₂..., x _n] ∈ R ^{M * n}, s _i=[s _I1..., s _Ii-1, 0, s _Ii+1..., s _In] ^T, s _iBe pending data set { x _i} _I=1 ⁿMiddle x _iThe reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.

In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization.

Described reconstructed error accumulation process, specifically:

S (r) = \frac{Σ_{i = 1}^{n} {(x_{ir} - {({Xs}_{i})}_{r})}^{2}}{Var (X (r, :))},

Wherein:

Var (X (r, :)) = \frac{1}{n - 1} Σ_{i = 1}^{n} (x_{ir} - m_{r})

S (r) is the sparse fraction of pending data centralization r dimensional feature, x _IrBe the r dimensional feature data of i data of pending data centralization, X is pending data set, X=[x ₁, x ₂..., x _n] ∈ R ^{M * n}, m _rIt is the average of r dimensional feature data.

The 4th step, the every dimensional feature of pending data centralization to be arranged according to its sparse fraction order from small to large, the feature of sparse fraction minimum promptly is this pending data centralization most important characteristic.

Compared with prior art, the invention has the beneficial effects as follows: utilized the constant characteristic of rarefaction representation reconstruction coefficients for rotation and dimensional variation, the method that the present invention proposes has the better robustness energy for noise and outlier data, can obtain better feature selecting effect; Simultaneously the feature selection approach of sparse fraction is a kind of unsupervised feature selection approach, so this method and without any need for prior imformation, and the applicable performance of method is very strong, can effectively improve the performance of classification prediction, and accuracy rate is very high.

The present invention can be widely used in pattern-recognition, machine learning and data and dig in the various classification of certificate, cluster and the visualization of data problem.

Description of drawings

Fig. 1 is the design sketch of wine data set two-dimensional visualization among the embodiment;

Fig. 2 is to the figure as a result of the predictablity rate of wine data set among the embodiment;

Fig. 3 adopts existing variance to divide counting method, Laplacian to divide counting method and embodiment sparse fraction method the wine data set to be carried out the comparison synoptic diagram of the accuracy rate of 10 retransposings classification prediction under different dimension situations respectively.

Embodiment

Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

Embodiment

Present embodiment launches at wine data set among the UCI, and the wine data set is the data set of a standard in the UCI database, and sample derives from three kinds of different chemical constitutions vinous that cultivate plants and brewage in certain area of Italy.This data set has 178 samples, and each sample is the proper vector of 13 dimensions.Present embodiment sorts to this 13 dimensional feature, selects most important characteristic and carries out feature selecting, specifically may further comprise the steps:

The first step is extracted pending data set { x _i} _I=1 ¹⁷⁸, data centralization has 178 data, and each data comprises 13 dimensional features.

Therefore each dimensional feature exists different attributes and unit in the present embodiment, and expressed meaning is also inequality, need carry out the normalization operation to data, and present embodiment has adopted and the norm of each sample has been normalized to 1 method, is specially:

{\tilde{x}}_{i} = \frac{x_{i}}{| | x_{i} | |}

Wherein, x _iThe expression sample data is concentrated the i dimensional feature vector,

Proper vector after the expression normalization.

Second step, adopt the minimized method of L1 norm, obtain pending data set

In the reconstruction coefficients of rarefaction representation of each data.

The minimized method of described L1 norm, specifically:

Wherein:

Be the pending data set after the normalization,

It is pending data set In

The reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.

The reconstruction coefficients matrix that present embodiment obtains

Wherein

It is pending data set

In

The reconstruction coefficients of rarefaction representation.

Described reconstructed error accumulation process, specifically:

Wherein:

S (r) is the sparse fraction of pending data centralization r dimensional feature,

It is pending data set

In the r dimensional feature data of i data,

m _rIt is the average of r dimensional feature data.

Each dimensional feature is arranged as according to its sparse fraction order from small to large in the present embodiment: 13,10,2,7,8,9,12,6,11,4,5,3,1, and promptly the 13rd dimensional feature is a most important characteristic, the 10th dimensional feature is a time important feature.

The 13rd peacekeeping the 10th dimensional feature of present embodiment extraction wine data centralization is drawn the two-dimensional visualization design sketch of this data set, as shown in Figure 1, by this figure as seen, adopts three kinds of different cultivating plants of present embodiment method well to be distinguished.

Further, the result of feature selecting is applied in the classification of Data prediction, order according to feature importance, extract the characteristic set of different numbers, adopt nearest neighbor classifier to the data prediction of classifying, present embodiment adopts the method for 10 retransposings checking, the result of final predictablity rate as shown in Figure 2, by this figure as seen when adopting the big preceding 8 dimension data set of importance, can obtain the highest predictablity rate 78.18%, and predictablity rate only is 76.02% when adopting whole set of data, has illustrated that the feature selection approach of present embodiment can effectively improve the performance of classification prediction.

When adopting existing variance to divide situation that sparse fraction method that counting method, existing Laplacian divide counting method and present embodiment uses the classification predictablity rate of nearest neighbor classifier under 10 retransposing checking situations to the wine data set shown in Fig. 3 and table 1 respectively, can find out obviously that thus feature selecting result that present embodiment obtains based on the sparse fraction predictablity rate of classifying is the highest.

Three kinds of feature selection approachs of table 1 are to the classification predictablity rate of wine data set

Claims

1. the feature selection approach based on sparse fraction is characterized in that, may further comprise the steps:

The first step is extracted pending data set { x _i} _I=1 ⁿ, data centralization has n data, and each data comprises the m dimensional feature;

Second step, adopt the minimized method of L1 norm, obtain pending data set { x _i} _I=1 ⁿIn the reconstruction coefficients of rarefaction representation of each data;

In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization;

2. the feature selection approach based on sparse fraction according to claim 1 is characterized in that, the minimized method of L1 norm described in second step is:

\min_{s_{i}} {| | s_{i} | |}_{1},

s.t.x _i＝Xs _i

Wherein: X is pending data set, X=[x ₁, x ₂..., x _n] ∈ R ^{M * n}, s _i=[s _I1, s _Ii-1, 0, s _Ii+1..., s _In] ^T, s _iBe pending data set { x _i} _I=1 ⁿMiddle x _iThe reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.

3. the feature selection approach based on sparse fraction according to claim 1 is characterized in that, the reconstructed error accumulation process described in the 3rd step is:

S (r) = \frac{Σ_{i = 1}^{n} {(x_{ir} - {({Xs}_{i})}_{r})}^{2}}{Var (X (r, :))},

Wherein:

Var (X (r, :)) = \frac{1}{n - 1} Σ_{i = 1}^{n} (x_{ir} - m_{r})