CN101840516A - Feature selection method based on sparse fraction - Google Patents

Feature selection method based on sparse fraction Download PDF

Info

Publication number
CN101840516A
CN101840516A CN201010157827A CN201010157827A CN101840516A CN 101840516 A CN101840516 A CN 101840516A CN 201010157827 A CN201010157827 A CN 201010157827A CN 201010157827 A CN201010157827 A CN 201010157827A CN 101840516 A CN101840516 A CN 101840516A
Authority
CN
China
Prior art keywords
data
feature
data set
sparse
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201010157827A
Other languages
Chinese (zh)
Inventor
杨杰
朱林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201010157827A priority Critical patent/CN101840516A/en
Publication of CN101840516A publication Critical patent/CN101840516A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a feature selection method based on sparse fraction in the technical field of information processing. The method comprises the following steps: extracting a data set to be processed; adopting the L1 norm minimizing method to obtain the reconfiguration coefficient of sparse representation of each data item in the data set to be processed; performing reconfiguration error-accumulating treatment to each dimension character of each data item in the data set to be processed and the reconfiguration coefficient of sparse representation of the corresponding data item so as to obtain the sparse fraction of each dimension character of the data set to be processed; and arranging the dimension characters of the data set to be processed from small to large order of the sparse fractions of the dimension characters, wherein the feature of the least sparse fraction is the most important feature of the data set to be processed. The feature selection method has better robust performance to noises and outliers, does not require any priori information and has higher applicability, thus effectively improving the classification forecasting performance. The invention can be widely used in pattern recognition, various categories of machine learning and data mining, and the visual problem of cluster and data.

Description

Feature selection approach based on sparse fraction
Technical field
What the present invention relates to is a kind of method of technical field of information processing, specifically is a kind of feature selection approach based on sparse fraction.
Background technology
Feature selecting (feature selection) is exactly that to pick out some the most effective features from a stack features be that k (come by one group of optimal characteristics of k<d) to reach the purpose that reduces the feature space dimension, promptly select quantity in the feature that one group of quantity is d.Appraisal procedure according to answer, feature selecting generally can roughly be divided into two big classes, one class methods are that feature selecting is separated with assorting process, and the process of feature selecting and concrete sorter are irrelevant, and these class methods are called filter-type method (filter approach); An other class is encapsulation type method (wrapper approach), and these class methods merge the design of feature selecting and sorter, and the sorter of feature selecting performance and employing is closely related.Difference maximum between them has been to use different assessment strategies.Two class methods differ from one another and be not enough: filtering type method realizes simple, the efficient height, but be used for Classification and Identification, then nicety of grading is not as the encapsulation type method; The encapsulation type method has higher classification performance, but the efficient of feature selecting is lower, and is not so good as filter method for different sorter transplantabilities.
The feature selection approach of early stage filter-type based on fractional equation mainly contains based on the feature selecting of variance mark (Variance Score) with based on the feature selecting of Fisher mark (Fisher Score).These two kinds all is the simplest and the most widely used feature selection approach.Adopt the feature of the feature selection approach of variance mark (Variance Score), give up the thought of the less feature of variance based on the concentrated variance maximum of retention data.Because the variance of certain one-dimensional characteristic has reflected the representative degree of this one dimension to whole data set, therefore,, choose the bigger feature of variance again according to the big minispread of its variance by calculating the variance situation of each dimensional feature, just reached the purpose of feature selecting.Because the method for variance mark has only been utilized the variance information of data, simple relatively, the result of the character subset that obtains on the complex data collection is very ineffective.And the method for variance mark is non-supervision in essence, does not utilize the classification information of sample.Afterwards, the someone had proposed the feature selection approach of Fisher mark again.The thought of this method is to seek to be used for the effectively feature of classification, if promptly data set is on certain dimensional feature, similarity is bigger in the class, and similarity is less between class, just thinks that this feature is good feature; If instead similarity is less in the class on this feature, and similarity is bigger between class, think that then this feature is not good feature.The method of Fisher mark has been utilized the classification information of data, is a kind of feature selection approach that supervision is effectively arranged.But experimental result shows under a stable condition, and the method for Fisher mark is undesirable for the multimodality (be in certain class sample be made of several independent clusters) and the processing of the class problem that peels off.
Through existing literature search is found, external people (Xiaofei He such as Xiaofei He, Deng Cai, and ParthaNiyogi., " Laplacian Score for Feature Selection (based on the feature selection approach of Laplce's mark) ", Advances in Neural Information Processing Systems 18 (NIPS 2005), Vancouver, Canada, 2005) in international neural information processing systems conference in 2005, proposed to utilize the Laplacian mark to carry out the method for feature selecting, this method is based on the thought of the local reserve capability of comparative feature, suppose that promptly a kind of good feature should be if two data points are approximate, then these two data points also should be similar on this feature, and the feature that satisfies this condition has the ability of stronger representative raw data.By calculating the Laplacian mark of every dimensional feature, the size according to mark sorts again like this, selects the less feature of fractional value, thereby carries out feature selecting.Experimental result shows that the method for Laplacian mark is also relatively more responsive to noise data, is vulnerable to the influence of noise spot.
Find by retrieval again, domestic people such as Zhang Daoqiang (Zhang DQ, Chen SC, Zhou ZH:Constraint Score:A new filter method for feature selection with pairwise constraints.PatternRecognition 41 (5): 1440-1451 2008, (Zhang Daoqiang, Chen Songcan, Zhou Zhihua, " based on the feature selection approach of constraint mark "), international the 41st the 5th phase of volume of pattern-recognition magazine, 2008) utilize the criterion of constraint mark also to propose a kind of method of new feature selecting.This method has utilized the supervision message (pairwise constraints) between data to carry out feature selecting.If promptly data belong to same class, then there is connection constraints (must-link) between these data, therefore good feature this moment should be nearer; Otherwise, if data do not belong to same class, then do not deposit connection constraints (cannot-link) between these data, therefore good feature this moment should be far away.Similar with Laplce's mark, the constraint mark also is to have utilized neighbor relationships that feature is judged, carries out effective feature selecting.But because the method for constraint mark has been utilized the supervision message between sample, therefore the connection constraints that need exist between specific data is in advance lacking under the Given information situation, and applicable performance is not fine.
Summary of the invention
The objective of the invention is to overcome the above-mentioned deficiency of prior art, a kind of feature selection approach based on sparse fraction is provided.The present invention utilizes the sparse restructuring matrix between sample, obtains the rarefaction representation reserve capability of each dimensional feature of data, thereby has proposed a kind of new feature selection approach.This method is owing to utilized the characteristics of the unchangeability of rotation of rarefaction representation coefficient between data and dimensional variation, therefore advantage with discriminant information reserve capability of better feature can well be applied in the forecasting problems such as the classification of pattern-recognition, machine learning and cluster.
The present invention is achieved by the following technical solutions, the present invention includes following steps:
The first step is extracted pending data set { x i} I=1 n, data centralization has n data, and each data comprises the m dimensional feature.
Second step, adopt the minimized method of L1 norm, obtain pending data set { x i} I=1 nIn the reconstruction coefficients of rarefaction representation of each data.
The minimized method of described L1 norm, specifically:
min s i | | s i | | 1 ,
s.t.x i=Xs i
Wherein: X is pending data set, X=[x 1, x 2..., x n] ∈ R M * n, s i=[s I1..., s Ii-1, 0, s Ii+1..., s In] T, s iBe pending data set { x i} I=1 nMiddle x iThe reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.
In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization.
Described reconstructed error accumulation process, specifically:
S ( r ) = Σ i = 1 n ( x ir - ( Xs i ) r ) 2 Var ( X ( r , : ) ) ,
Wherein: Var ( X ( r , : ) ) = 1 n - 1 Σ i = 1 n ( x ir - m r )
S (r) is the sparse fraction of pending data centralization r dimensional feature, x IrBe the r dimensional feature data of i data of pending data centralization, X is pending data set, X=[x 1, x 2..., x n] ∈ R M * n, m rIt is the average of r dimensional feature data.
The 4th step, the every dimensional feature of pending data centralization to be arranged according to its sparse fraction order from small to large, the feature of sparse fraction minimum promptly is this pending data centralization most important characteristic.
Compared with prior art, the invention has the beneficial effects as follows: utilized the constant characteristic of rarefaction representation reconstruction coefficients for rotation and dimensional variation, the method that the present invention proposes has the better robustness energy for noise and outlier data, can obtain better feature selecting effect; Simultaneously the feature selection approach of sparse fraction is a kind of unsupervised feature selection approach, so this method and without any need for prior imformation, and the applicable performance of method is very strong, can effectively improve the performance of classification prediction, and accuracy rate is very high.
The present invention can be widely used in pattern-recognition, machine learning and data and dig in the various classification of certificate, cluster and the visualization of data problem.
Description of drawings
Fig. 1 is the design sketch of wine data set two-dimensional visualization among the embodiment;
Fig. 2 is to the figure as a result of the predictablity rate of wine data set among the embodiment;
Fig. 3 adopts existing variance to divide counting method, Laplacian to divide counting method and embodiment sparse fraction method the wine data set to be carried out the comparison synoptic diagram of the accuracy rate of 10 retransposings classification prediction under different dimension situations respectively.
Embodiment
Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
Embodiment
Present embodiment launches at wine data set among the UCI, and the wine data set is the data set of a standard in the UCI database, and sample derives from three kinds of different chemical constitutions vinous that cultivate plants and brewage in certain area of Italy.This data set has 178 samples, and each sample is the proper vector of 13 dimensions.Present embodiment sorts to this 13 dimensional feature, selects most important characteristic and carries out feature selecting, specifically may further comprise the steps:
The first step is extracted pending data set { x i} I=1 178, data centralization has 178 data, and each data comprises 13 dimensional features.
Therefore each dimensional feature exists different attributes and unit in the present embodiment, and expressed meaning is also inequality, need carry out the normalization operation to data, and present embodiment has adopted and the norm of each sample has been normalized to 1 method, is specially:
x ~ i = x i | | x i | |
Wherein, x iThe expression sample data is concentrated the i dimensional feature vector,
Figure GDA0000020930720000042
Proper vector after the expression normalization.
Second step, adopt the minimized method of L1 norm, obtain pending data set
Figure GDA0000020930720000043
In the reconstruction coefficients of rarefaction representation of each data.
The minimized method of described L1 norm, specifically:
Wherein:
Figure GDA0000020930720000046
Be the pending data set after the normalization,
Figure GDA0000020930720000047
Figure GDA0000020930720000048
Figure GDA0000020930720000049
It is pending data set In
Figure GDA00000209307200000411
The reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.
The reconstruction coefficients matrix that present embodiment obtains
Figure GDA0000020930720000051
Wherein
Figure GDA0000020930720000052
Figure GDA0000020930720000053
It is pending data set
Figure GDA0000020930720000054
In
Figure GDA0000020930720000055
The reconstruction coefficients of rarefaction representation.
In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization.
Described reconstructed error accumulation process, specifically:
Figure GDA0000020930720000056
Wherein:
Figure GDA0000020930720000057
S (r) is the sparse fraction of pending data centralization r dimensional feature,
Figure GDA0000020930720000058
It is pending data set
Figure GDA0000020930720000059
In the r dimensional feature data of i data,
Figure GDA00000209307200000510
m rIt is the average of r dimensional feature data.
The 4th step, the every dimensional feature of pending data centralization to be arranged according to its sparse fraction order from small to large, the feature of sparse fraction minimum promptly is this pending data centralization most important characteristic.
Each dimensional feature is arranged as according to its sparse fraction order from small to large in the present embodiment: 13,10,2,7,8,9,12,6,11,4,5,3,1, and promptly the 13rd dimensional feature is a most important characteristic, the 10th dimensional feature is a time important feature.
The 13rd peacekeeping the 10th dimensional feature of present embodiment extraction wine data centralization is drawn the two-dimensional visualization design sketch of this data set, as shown in Figure 1, by this figure as seen, adopts three kinds of different cultivating plants of present embodiment method well to be distinguished.
Further, the result of feature selecting is applied in the classification of Data prediction, order according to feature importance, extract the characteristic set of different numbers, adopt nearest neighbor classifier to the data prediction of classifying, present embodiment adopts the method for 10 retransposings checking, the result of final predictablity rate as shown in Figure 2, by this figure as seen when adopting the big preceding 8 dimension data set of importance, can obtain the highest predictablity rate 78.18%, and predictablity rate only is 76.02% when adopting whole set of data, has illustrated that the feature selection approach of present embodiment can effectively improve the performance of classification prediction.
When adopting existing variance to divide situation that sparse fraction method that counting method, existing Laplacian divide counting method and present embodiment uses the classification predictablity rate of nearest neighbor classifier under 10 retransposing checking situations to the wine data set shown in Fig. 3 and table 1 respectively, can find out obviously that thus feature selecting result that present embodiment obtains based on the sparse fraction predictablity rate of classifying is the highest.
Three kinds of feature selection approachs of table 1 are to the classification predictablity rate of wine data set
Figure GDA0000020930720000061

Claims (3)

1. the feature selection approach based on sparse fraction is characterized in that, may further comprise the steps:
The first step is extracted pending data set { x i} I=1 n, data centralization has n data, and each data comprises the m dimensional feature;
Second step, adopt the minimized method of L1 norm, obtain pending data set { x i} I=1 nIn the reconstruction coefficients of rarefaction representation of each data;
In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization;
The 4th step, the every dimensional feature of pending data centralization to be arranged according to its sparse fraction order from small to large, the feature of sparse fraction minimum promptly is this pending data centralization most important characteristic.
2. the feature selection approach based on sparse fraction according to claim 1 is characterized in that, the minimized method of L1 norm described in second step is:
min s i | | s i | | 1 ,
s.t.x i=Xs i
Wherein: X is pending data set, X=[x 1, x 2..., x n] ∈ R M * n, s i=[s I1, s Ii-1, 0, s Ii+1..., s In] T, s iBe pending data set { x i} I=1 nMiddle x iThe reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.
3. the feature selection approach based on sparse fraction according to claim 1 is characterized in that, the reconstructed error accumulation process described in the 3rd step is:
S ( r ) = Σ i = 1 n ( x ir - ( Xs i ) r ) 2 Var ( X ( r , : ) ) ,
Wherein: Var ( X ( r , : ) ) = 1 n - 1 Σ i = 1 n ( x ir - m r )
S (r) is the sparse fraction of pending data centralization r dimensional feature, x IrBe the r dimensional feature data of i data of pending data centralization, X is pending data set, X=[x 1, x 2..., x n] ∈ R M * n, m rIt is the average of r dimensional feature data.
CN201010157827A 2010-04-27 2010-04-27 Feature selection method based on sparse fraction Pending CN101840516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010157827A CN101840516A (en) 2010-04-27 2010-04-27 Feature selection method based on sparse fraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010157827A CN101840516A (en) 2010-04-27 2010-04-27 Feature selection method based on sparse fraction

Publications (1)

Publication Number Publication Date
CN101840516A true CN101840516A (en) 2010-09-22

Family

ID=42743878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010157827A Pending CN101840516A (en) 2010-04-27 2010-04-27 Feature selection method based on sparse fraction

Country Status (1)

Country Link
CN (1) CN101840516A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521382A (en) * 2011-12-21 2012-06-27 中国科学院自动化研究所 Method for compressing video dictionary
CN102722578A (en) * 2012-05-31 2012-10-10 浙江大学 Unsupervised cluster characteristic selection method based on Laplace regularization
CN102789490A (en) * 2012-07-04 2012-11-21 苏州大学 Data visualization method and system
CN104408480A (en) * 2014-11-28 2015-03-11 安徽师范大学 Feature selection method based on Laplacian operator
CN103678436B (en) * 2012-09-18 2017-04-12 株式会社日立制作所 Information processing system and information processing method
CN107133643A (en) * 2017-04-29 2017-09-05 天津大学 Note signal sorting technique based on multiple features fusion and feature selecting
CN110210559A (en) * 2019-05-31 2019-09-06 北京小米移动软件有限公司 Object screening technique and device, storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521382A (en) * 2011-12-21 2012-06-27 中国科学院自动化研究所 Method for compressing video dictionary
CN102722578A (en) * 2012-05-31 2012-10-10 浙江大学 Unsupervised cluster characteristic selection method based on Laplace regularization
CN102789490A (en) * 2012-07-04 2012-11-21 苏州大学 Data visualization method and system
CN102789490B (en) * 2012-07-04 2014-11-05 苏州大学 Data visualization method and system
CN103678436B (en) * 2012-09-18 2017-04-12 株式会社日立制作所 Information processing system and information processing method
CN104408480A (en) * 2014-11-28 2015-03-11 安徽师范大学 Feature selection method based on Laplacian operator
CN104408480B (en) * 2014-11-28 2018-05-04 安徽师范大学 A kind of feature selection approach based on Laplacian operators
CN107133643A (en) * 2017-04-29 2017-09-05 天津大学 Note signal sorting technique based on multiple features fusion and feature selecting
CN110210559A (en) * 2019-05-31 2019-09-06 北京小米移动软件有限公司 Object screening technique and device, storage medium
CN110210559B (en) * 2019-05-31 2021-10-08 北京小米移动软件有限公司 Object screening method and device and storage medium

Similar Documents

Publication Publication Date Title
CN101840516A (en) Feature selection method based on sparse fraction
Campello et al. Hierarchical density estimates for data clustering, visualization, and outlier detection
US7853542B2 (en) Method for grid-based data clustering
Kannan et al. Image clustering and retrieval using image mining techniques
CN102364498B (en) Multi-label-based image recognition method
CN101923653B (en) Multilevel content description-based image classification method
Chen et al. Pruning support vectors for imbalanced data classification
CN104850859A (en) Multi-scale analysis based image feature bag constructing method
Mierswa et al. Information preserving multi-objective feature selection for unsupervised learning
Nasierding et al. Empirical study of multi-label classification methods for image annotation and retrieval
Saravanan et al. Video image retrieval using data mining techniques
Straccamore et al. Which will be your firm’s next technology? Comparison between machine learning and network-based algorithms
CN106557785A (en) A kind of support vector machine method of optimization data classification
Rani et al. Comparison of clustering techniques for measuring similarity in articles
Malik et al. Clustering web images using association rules, interestingness measures, and hypergraph partitions
CN112699921A (en) Stack denoising self-coding-based power grid transient fault data clustering cleaning method
Singh et al. Survey on outlier detection in data mining
CN111581298A (en) Heterogeneous data integration system and method for large data warehouse
Dorai et al. End-to-end videotext recognition for multimedia content analysis
Jiang et al. A hybrid clustering algorithm
Ali et al. Subject review: text clustering algorithms
Augereau et al. Document images indexing with relevance feedback: an application to industrial context
CN104915685A (en) Image representation method based on multi-rectangular partitioning
En et al. Region proposal for pattern spotting in historical document images
Coronica Feature learning and clustering analysis for images classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20100922