CN101840516A - Feature selection method based on sparse fraction - Google Patents
Feature selection method based on sparse fraction Download PDFInfo
- Publication number
- CN101840516A CN101840516A CN201010157827A CN201010157827A CN101840516A CN 101840516 A CN101840516 A CN 101840516A CN 201010157827 A CN201010157827 A CN 201010157827A CN 201010157827 A CN201010157827 A CN 201010157827A CN 101840516 A CN101840516 A CN 101840516A
- Authority
- CN
- China
- Prior art keywords
- data
- feature
- data set
- sparse
- processed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a feature selection method based on sparse fraction in the technical field of information processing. The method comprises the following steps: extracting a data set to be processed; adopting the L1 norm minimizing method to obtain the reconfiguration coefficient of sparse representation of each data item in the data set to be processed; performing reconfiguration error-accumulating treatment to each dimension character of each data item in the data set to be processed and the reconfiguration coefficient of sparse representation of the corresponding data item so as to obtain the sparse fraction of each dimension character of the data set to be processed; and arranging the dimension characters of the data set to be processed from small to large order of the sparse fractions of the dimension characters, wherein the feature of the least sparse fraction is the most important feature of the data set to be processed. The feature selection method has better robust performance to noises and outliers, does not require any priori information and has higher applicability, thus effectively improving the classification forecasting performance. The invention can be widely used in pattern recognition, various categories of machine learning and data mining, and the visual problem of cluster and data.
Description
Technical field
What the present invention relates to is a kind of method of technical field of information processing, specifically is a kind of feature selection approach based on sparse fraction.
Background technology
Feature selecting (feature selection) is exactly that to pick out some the most effective features from a stack features be that k (come by one group of optimal characteristics of k<d) to reach the purpose that reduces the feature space dimension, promptly select quantity in the feature that one group of quantity is d.Appraisal procedure according to answer, feature selecting generally can roughly be divided into two big classes, one class methods are that feature selecting is separated with assorting process, and the process of feature selecting and concrete sorter are irrelevant, and these class methods are called filter-type method (filter approach); An other class is encapsulation type method (wrapper approach), and these class methods merge the design of feature selecting and sorter, and the sorter of feature selecting performance and employing is closely related.Difference maximum between them has been to use different assessment strategies.Two class methods differ from one another and be not enough: filtering type method realizes simple, the efficient height, but be used for Classification and Identification, then nicety of grading is not as the encapsulation type method; The encapsulation type method has higher classification performance, but the efficient of feature selecting is lower, and is not so good as filter method for different sorter transplantabilities.
The feature selection approach of early stage filter-type based on fractional equation mainly contains based on the feature selecting of variance mark (Variance Score) with based on the feature selecting of Fisher mark (Fisher Score).These two kinds all is the simplest and the most widely used feature selection approach.Adopt the feature of the feature selection approach of variance mark (Variance Score), give up the thought of the less feature of variance based on the concentrated variance maximum of retention data.Because the variance of certain one-dimensional characteristic has reflected the representative degree of this one dimension to whole data set, therefore,, choose the bigger feature of variance again according to the big minispread of its variance by calculating the variance situation of each dimensional feature, just reached the purpose of feature selecting.Because the method for variance mark has only been utilized the variance information of data, simple relatively, the result of the character subset that obtains on the complex data collection is very ineffective.And the method for variance mark is non-supervision in essence, does not utilize the classification information of sample.Afterwards, the someone had proposed the feature selection approach of Fisher mark again.The thought of this method is to seek to be used for the effectively feature of classification, if promptly data set is on certain dimensional feature, similarity is bigger in the class, and similarity is less between class, just thinks that this feature is good feature; If instead similarity is less in the class on this feature, and similarity is bigger between class, think that then this feature is not good feature.The method of Fisher mark has been utilized the classification information of data, is a kind of feature selection approach that supervision is effectively arranged.But experimental result shows under a stable condition, and the method for Fisher mark is undesirable for the multimodality (be in certain class sample be made of several independent clusters) and the processing of the class problem that peels off.
Through existing literature search is found, external people (Xiaofei He such as Xiaofei He, Deng Cai, and ParthaNiyogi., " Laplacian Score for Feature Selection (based on the feature selection approach of Laplce's mark) ", Advances in Neural Information Processing Systems 18 (NIPS 2005), Vancouver, Canada, 2005) in international neural information processing systems conference in 2005, proposed to utilize the Laplacian mark to carry out the method for feature selecting, this method is based on the thought of the local reserve capability of comparative feature, suppose that promptly a kind of good feature should be if two data points are approximate, then these two data points also should be similar on this feature, and the feature that satisfies this condition has the ability of stronger representative raw data.By calculating the Laplacian mark of every dimensional feature, the size according to mark sorts again like this, selects the less feature of fractional value, thereby carries out feature selecting.Experimental result shows that the method for Laplacian mark is also relatively more responsive to noise data, is vulnerable to the influence of noise spot.
Find by retrieval again, domestic people such as Zhang Daoqiang (Zhang DQ, Chen SC, Zhou ZH:Constraint Score:A new filter method for feature selection with pairwise constraints.PatternRecognition 41 (5): 1440-1451 2008, (Zhang Daoqiang, Chen Songcan, Zhou Zhihua, " based on the feature selection approach of constraint mark "), international the 41st the 5th phase of volume of pattern-recognition magazine, 2008) utilize the criterion of constraint mark also to propose a kind of method of new feature selecting.This method has utilized the supervision message (pairwise constraints) between data to carry out feature selecting.If promptly data belong to same class, then there is connection constraints (must-link) between these data, therefore good feature this moment should be nearer; Otherwise, if data do not belong to same class, then do not deposit connection constraints (cannot-link) between these data, therefore good feature this moment should be far away.Similar with Laplce's mark, the constraint mark also is to have utilized neighbor relationships that feature is judged, carries out effective feature selecting.But because the method for constraint mark has been utilized the supervision message between sample, therefore the connection constraints that need exist between specific data is in advance lacking under the Given information situation, and applicable performance is not fine.
Summary of the invention
The objective of the invention is to overcome the above-mentioned deficiency of prior art, a kind of feature selection approach based on sparse fraction is provided.The present invention utilizes the sparse restructuring matrix between sample, obtains the rarefaction representation reserve capability of each dimensional feature of data, thereby has proposed a kind of new feature selection approach.This method is owing to utilized the characteristics of the unchangeability of rotation of rarefaction representation coefficient between data and dimensional variation, therefore advantage with discriminant information reserve capability of better feature can well be applied in the forecasting problems such as the classification of pattern-recognition, machine learning and cluster.
The present invention is achieved by the following technical solutions, the present invention includes following steps:
The first step is extracted pending data set { x
i}
I=1 n, data centralization has n data, and each data comprises the m dimensional feature.
Second step, adopt the minimized method of L1 norm, obtain pending data set { x
i}
I=1 nIn the reconstruction coefficients of rarefaction representation of each data.
The minimized method of described L1 norm, specifically:
s.t.x
i=Xs
i
Wherein: X is pending data set, X=[x
1, x
2..., x
n] ∈ R
M * n, s
i=[s
I1..., s
Ii-1, 0, s
Ii+1..., s
In]
T, s
iBe pending data set { x
i}
I=1 nMiddle x
iThe reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.
In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization.
Described reconstructed error accumulation process, specifically:
Wherein:
S (r) is the sparse fraction of pending data centralization r dimensional feature, x
IrBe the r dimensional feature data of i data of pending data centralization, X is pending data set, X=[x
1, x
2..., x
n] ∈ R
M * n, m
rIt is the average of r dimensional feature data.
The 4th step, the every dimensional feature of pending data centralization to be arranged according to its sparse fraction order from small to large, the feature of sparse fraction minimum promptly is this pending data centralization most important characteristic.
Compared with prior art, the invention has the beneficial effects as follows: utilized the constant characteristic of rarefaction representation reconstruction coefficients for rotation and dimensional variation, the method that the present invention proposes has the better robustness energy for noise and outlier data, can obtain better feature selecting effect; Simultaneously the feature selection approach of sparse fraction is a kind of unsupervised feature selection approach, so this method and without any need for prior imformation, and the applicable performance of method is very strong, can effectively improve the performance of classification prediction, and accuracy rate is very high.
The present invention can be widely used in pattern-recognition, machine learning and data and dig in the various classification of certificate, cluster and the visualization of data problem.
Description of drawings
Fig. 1 is the design sketch of wine data set two-dimensional visualization among the embodiment;
Fig. 2 is to the figure as a result of the predictablity rate of wine data set among the embodiment;
Fig. 3 adopts existing variance to divide counting method, Laplacian to divide counting method and embodiment sparse fraction method the wine data set to be carried out the comparison synoptic diagram of the accuracy rate of 10 retransposings classification prediction under different dimension situations respectively.
Embodiment
Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
Embodiment
Present embodiment launches at wine data set among the UCI, and the wine data set is the data set of a standard in the UCI database, and sample derives from three kinds of different chemical constitutions vinous that cultivate plants and brewage in certain area of Italy.This data set has 178 samples, and each sample is the proper vector of 13 dimensions.Present embodiment sorts to this 13 dimensional feature, selects most important characteristic and carries out feature selecting, specifically may further comprise the steps:
The first step is extracted pending data set { x
i}
I=1 178, data centralization has 178 data, and each data comprises 13 dimensional features.
Therefore each dimensional feature exists different attributes and unit in the present embodiment, and expressed meaning is also inequality, need carry out the normalization operation to data, and present embodiment has adopted and the norm of each sample has been normalized to 1 method, is specially:
Wherein, x
iThe expression sample data is concentrated the i dimensional feature vector,
Proper vector after the expression normalization.
Second step, adopt the minimized method of L1 norm, obtain pending data set
In the reconstruction coefficients of rarefaction representation of each data.
The minimized method of described L1 norm, specifically:
Wherein:
Be the pending data set after the normalization,
It is pending data set
In
The reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.
The reconstruction coefficients matrix that present embodiment obtains
Wherein
It is pending data set
In
The reconstruction coefficients of rarefaction representation.
In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization.
Described reconstructed error accumulation process, specifically:
S (r) is the sparse fraction of pending data centralization r dimensional feature,
It is pending data set
In the r dimensional feature data of i data,
m
rIt is the average of r dimensional feature data.
The 4th step, the every dimensional feature of pending data centralization to be arranged according to its sparse fraction order from small to large, the feature of sparse fraction minimum promptly is this pending data centralization most important characteristic.
Each dimensional feature is arranged as according to its sparse fraction order from small to large in the present embodiment: 13,10,2,7,8,9,12,6,11,4,5,3,1, and promptly the 13rd dimensional feature is a most important characteristic, the 10th dimensional feature is a time important feature.
The 13rd peacekeeping the 10th dimensional feature of present embodiment extraction wine data centralization is drawn the two-dimensional visualization design sketch of this data set, as shown in Figure 1, by this figure as seen, adopts three kinds of different cultivating plants of present embodiment method well to be distinguished.
Further, the result of feature selecting is applied in the classification of Data prediction, order according to feature importance, extract the characteristic set of different numbers, adopt nearest neighbor classifier to the data prediction of classifying, present embodiment adopts the method for 10 retransposings checking, the result of final predictablity rate as shown in Figure 2, by this figure as seen when adopting the big preceding 8 dimension data set of importance, can obtain the highest predictablity rate 78.18%, and predictablity rate only is 76.02% when adopting whole set of data, has illustrated that the feature selection approach of present embodiment can effectively improve the performance of classification prediction.
When adopting existing variance to divide situation that sparse fraction method that counting method, existing Laplacian divide counting method and present embodiment uses the classification predictablity rate of nearest neighbor classifier under 10 retransposing checking situations to the wine data set shown in Fig. 3 and table 1 respectively, can find out obviously that thus feature selecting result that present embodiment obtains based on the sparse fraction predictablity rate of classifying is the highest.
Three kinds of feature selection approachs of table 1 are to the classification predictablity rate of wine data set
Claims (3)
1. the feature selection approach based on sparse fraction is characterized in that, may further comprise the steps:
The first step is extracted pending data set { x
i}
I=1 n, data centralization has n data, and each data comprises the m dimensional feature;
Second step, adopt the minimized method of L1 norm, obtain pending data set { x
i}
I=1 nIn the reconstruction coefficients of rarefaction representation of each data;
In the 3rd step, the reconstruction coefficients for the treatment of every dimensional feature that deal with data concentrates each data and the rarefaction representation of corresponding data is reconstructed the error accumulation processing, obtains the sparse fraction of the every dimensional feature of pending data centralization;
The 4th step, the every dimensional feature of pending data centralization to be arranged according to its sparse fraction order from small to large, the feature of sparse fraction minimum promptly is this pending data centralization most important characteristic.
2. the feature selection approach based on sparse fraction according to claim 1 is characterized in that, the minimized method of L1 norm described in second step is:
s.t.x
i=Xs
i
Wherein: X is pending data set, X=[x
1, x
2..., x
n] ∈ R
M * n, s
i=[s
I1, s
Ii-1, 0, s
Ii+1..., s
In]
T, s
iBe pending data set { x
i}
I=1 nMiddle x
iThe reconstruction coefficients of rarefaction representation, s.t. representative constraint symbol.
3. the feature selection approach based on sparse fraction according to claim 1 is characterized in that, the reconstructed error accumulation process described in the 3rd step is:
Wherein:
S (r) is the sparse fraction of pending data centralization r dimensional feature, x
IrBe the r dimensional feature data of i data of pending data centralization, X is pending data set, X=[x
1, x
2..., x
n] ∈ R
M * n, m
rIt is the average of r dimensional feature data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010157827A CN101840516A (en) | 2010-04-27 | 2010-04-27 | Feature selection method based on sparse fraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010157827A CN101840516A (en) | 2010-04-27 | 2010-04-27 | Feature selection method based on sparse fraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101840516A true CN101840516A (en) | 2010-09-22 |
Family
ID=42743878
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010157827A Pending CN101840516A (en) | 2010-04-27 | 2010-04-27 | Feature selection method based on sparse fraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101840516A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521382A (en) * | 2011-12-21 | 2012-06-27 | 中国科学院自动化研究所 | Method for compressing video dictionary |
CN102722578A (en) * | 2012-05-31 | 2012-10-10 | 浙江大学 | Unsupervised cluster characteristic selection method based on Laplace regularization |
CN102789490A (en) * | 2012-07-04 | 2012-11-21 | 苏州大学 | Data visualization method and system |
CN104408480A (en) * | 2014-11-28 | 2015-03-11 | 安徽师范大学 | Feature selection method based on Laplacian operator |
CN103678436B (en) * | 2012-09-18 | 2017-04-12 | 株式会社日立制作所 | Information processing system and information processing method |
CN107133643A (en) * | 2017-04-29 | 2017-09-05 | 天津大学 | Note signal sorting technique based on multiple features fusion and feature selecting |
CN110210559A (en) * | 2019-05-31 | 2019-09-06 | 北京小米移动软件有限公司 | Object screening technique and device, storage medium |
-
2010
- 2010-04-27 CN CN201010157827A patent/CN101840516A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521382A (en) * | 2011-12-21 | 2012-06-27 | 中国科学院自动化研究所 | Method for compressing video dictionary |
CN102722578A (en) * | 2012-05-31 | 2012-10-10 | 浙江大学 | Unsupervised cluster characteristic selection method based on Laplace regularization |
CN102789490A (en) * | 2012-07-04 | 2012-11-21 | 苏州大学 | Data visualization method and system |
CN102789490B (en) * | 2012-07-04 | 2014-11-05 | 苏州大学 | Data visualization method and system |
CN103678436B (en) * | 2012-09-18 | 2017-04-12 | 株式会社日立制作所 | Information processing system and information processing method |
CN104408480A (en) * | 2014-11-28 | 2015-03-11 | 安徽师范大学 | Feature selection method based on Laplacian operator |
CN104408480B (en) * | 2014-11-28 | 2018-05-04 | 安徽师范大学 | A kind of feature selection approach based on Laplacian operators |
CN107133643A (en) * | 2017-04-29 | 2017-09-05 | 天津大学 | Note signal sorting technique based on multiple features fusion and feature selecting |
CN110210559A (en) * | 2019-05-31 | 2019-09-06 | 北京小米移动软件有限公司 | Object screening technique and device, storage medium |
CN110210559B (en) * | 2019-05-31 | 2021-10-08 | 北京小米移动软件有限公司 | Object screening method and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101840516A (en) | Feature selection method based on sparse fraction | |
Campello et al. | Hierarchical density estimates for data clustering, visualization, and outlier detection | |
US7853542B2 (en) | Method for grid-based data clustering | |
Kannan et al. | Image clustering and retrieval using image mining techniques | |
CN102364498B (en) | Multi-label-based image recognition method | |
CN101923653B (en) | Multilevel content description-based image classification method | |
Chen et al. | Pruning support vectors for imbalanced data classification | |
CN104850859A (en) | Multi-scale analysis based image feature bag constructing method | |
Mierswa et al. | Information preserving multi-objective feature selection for unsupervised learning | |
Nasierding et al. | Empirical study of multi-label classification methods for image annotation and retrieval | |
Saravanan et al. | Video image retrieval using data mining techniques | |
Straccamore et al. | Which will be your firm’s next technology? Comparison between machine learning and network-based algorithms | |
CN106557785A (en) | A kind of support vector machine method of optimization data classification | |
Rani et al. | Comparison of clustering techniques for measuring similarity in articles | |
Malik et al. | Clustering web images using association rules, interestingness measures, and hypergraph partitions | |
CN112699921A (en) | Stack denoising self-coding-based power grid transient fault data clustering cleaning method | |
Singh et al. | Survey on outlier detection in data mining | |
CN111581298A (en) | Heterogeneous data integration system and method for large data warehouse | |
Dorai et al. | End-to-end videotext recognition for multimedia content analysis | |
Jiang et al. | A hybrid clustering algorithm | |
Ali et al. | Subject review: text clustering algorithms | |
Augereau et al. | Document images indexing with relevance feedback: an application to industrial context | |
CN104915685A (en) | Image representation method based on multi-rectangular partitioning | |
En et al. | Region proposal for pattern spotting in historical document images | |
Coronica | Feature learning and clustering analysis for images classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20100922 |