CN106503731A

CN106503731A - A kind of based on conditional mutual information and the unsupervised feature selection approach of K means

Info

Publication number: CN106503731A
Application number: CN201610888945.0A
Authority: CN
Inventors: 马廷淮; 邵文晔; 曹杰; 薛羽
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2016-10-11
Filing date: 2016-10-11
Publication date: 2017-03-15

Abstract

The present invention provides a kind of based on conditional mutual information and the unsupervised feature selection approach of K means, the data without class label are clustered by the different K means algorithms of multiple primary condition first, then on the basis of cluster each time, consider each feature modularization metric and different characteristic between conditional mutual information, select using the correlation independence index between feature that the degree of correlation is high and the little character subset of redundancy.By being collected the character subset that different K means cluster results are obtained, final character subset is obtained.The present invention can be effectively applied to without label and unbalanced data set, and the character subset degree of correlation height of acquisition, redundancy are little.

Description

A kind of based on conditional mutual information and the unsupervised feature selection approach of K-means

Technical field

The invention belongs to the feature selection issues in machine learning field, and in particular to be a kind of using conditional mutual information with Method of the K-means algorithms to carrying out unsupervised feature selecting without label data collection.

Background technology

In the practical application of machine learning, feature quantity is often more, wherein there may be incoherent feature, feature Between also likely to be present and interdepend.Characteristic Number is more, and the time that analyzes needed for feature, training pattern is longer, Er Qierong Easily cause " dimension disaster ", make model increasingly complex, so as to bring the consequences such as model Generalization Ability decline.Therefore, feature is carried out Select particularly important.

Feature selecting is also referred to as feature subset selection or Attributions selection, refers to one character subset of selection from whole features, Make the model for constructing more preferable.Feature selecting can reject the feature of uncorrelated or redundancy, so as to reach minimizing Characteristic Number, carry High model accuracy, reduces the purpose of run time.On the other hand, very positively related feature reduction model is selected, makes to grind Study carefully the process that personnel should be readily appreciated that data are produced.

Different from the combination for building learning model according to search optimal feature subset, feature selection approach can be big Cause is divided into two class of packaged type feature selecting (Wrapper) and filtering type feature selecting (Filter).Packaged type feature selecting is continuous Repeatedly operation learning algorithm goes the quality of evaluation attribute collection, and it is better than filtering type feature selecting in precision, but for other For grader, its Generalization Capability is poor.High Dimensional Data Set is faced, as packaged type feature selecting needs and specific study Algorithm is combined closely, and therefore the computation complexity in learning process is very high.Filtering type feature selecting specifically need not learn to calculate Method, but carry out the quality of Fast Evaluation feature using suitable criterion, therefore it is a kind of higher method of computational efficiency.

Existing most of traditional characteristic system of selection is to improve nicety of grading as optimization aim, does not take into full account number According to the distribution situation of sample, and the generally results of learning of pursuit big class, easily ignore the learning performance of group.For solving data not The problem of balance, in data plane, can carry out double sampling to the positive class sample of training set, so that positive and negative class before training Sample reaches balance, is then learnt (Exploratory under-sampling for class- again accordingly Imbalance learning.Liu X Y, Wu J, Zhou Z H), but this is obtained by cannot all data, can make score Class precision reduces.In algorithm aspect, according to data category be distributed disequilibrium the characteristics of traditional characteristic selection algorithm is carried out Improve, so that algorithm adapts to unbalanced sample (the feature selecting new algorithm in unbalanced problem of category distribution:IM-IG. outstanding ring Space, Chen Yan, Li Guozheng), but this method is confined to the unbalanced problem of two classes, for the unbalanced problem of multiclass and does not apply to.

For filtering type feature selecting, existing many supervised feature selection approach are suggested at present, such as apply mutually Information is estimated to candidate feature, and the several features for selecting ranking most front are used as the input (Using of neural network classifier mutual information for selecting features in supervised neural net Learning.R.Battiti), but this method have ignored the redundancy between feature, so as to cause the spy for selecting many redundancies Levy, and be unfavorable for that the performance of subsequent classifier is improved.And this method is only applicable to the data with class label information, right In unsupervised feature selecting and do not apply to.

In unsupervised feature selecting field, many is applied to the unsupervised feature selection approach of text and is suggested, but this A little methods cannot directly apply to numeric type data.Certain applications in the unsupervised feature selection approach of numeric data, such as towards The unsupervised filtering type feature selecting algorithm of characteristic of division, based on one-pass clustering algorithm, using each feature in different clusters Between the importance degree that showed as basis for estimation, Changing Pattern selected characteristic subset finally according to importance (towards point The unsupervised feature selection approach research of category feature. Wang Lianxi, Jiang Shengyi), this method is only using one-pass clustering algorithm logarithm According to being divided so that the result of cluster has randomness, it is impossible to ensure the accuracy of feature selecting.

The present invention is clustered to the data without class label by the different K-means algorithms of multiple primary condition first, Then here cluster on the basis of, consider each feature modularization metric and different characteristic between conditional mutual information, Degree of correlation height and the little character subset of redundancy is obtained, finally the character subset that different K-means cluster results are obtained is carried out Collect.

Content of the invention

Purpose：The technical problem to be solved is the feature selection issues without label data collection, proposes a kind of base In conditional mutual information and the unsupervised feature selection approach of K-means.By the different K-means algorithms pair of multiple primary condition Data without class label are clustered, and eliminating carries out the randomness of feature selecting on single cluster result, and it is uneven to reduce data Impact of the weighing apparatus to feature selecting.On the basis of cluster each time, the modularization metric and not of each feature is considered With the conditional mutual information between feature, degree of correlation height is selected using the correlation independence index between feature and redundancy is little Combinations of features.By being collected the character subset that different K-means cluster results are obtained, final feature is obtained Collection.The present invention can be effectively applied to without label and unbalanced data set, and the character subset degree of correlation height of acquisition, redundancy Degree is little.

Technical scheme is as follows：

A kind of based on conditional mutual information and the unsupervised feature selection approach of K-means, comprise the following steps：

Step 1), to carrying out the K-means clusters of multiple different K values and different cluster centres without label data collection, and obtain Obtain cluster result every time；

Step 2), according to step 1) the different cluster results that obtain, it is special each to be constructed for each cluster result successively The feature vector chart that levies；

Step 3), according to step 2) feature vector chart that constructs, calculate the modularization metric of each feature, and by mould The maximum feature of block metric is put in character subset；

Step 4), according to step 3) the initial characteristicses subset that obtains, in calculating each residue character relative to character subset The conditional mutual information of each feature, so that calculate correlation independence metric of each residue character relative to character subset；

Step 5), by step 3) the modularization metric of each residue character that obtains and step 4) and obtain related independent Property metric is added with certain weight, using result of calculation as each residue character score；

Step 6), by step 5) feature of highest scoring that obtains is put in character subset, is then made iteratively step 4), step 5), step 6), the Characteristic Number in character subset reach required for number；

Step 7), by step 6) obtain according to different K-means cluster results formed character subset collected, obtain Arrive final character subset.

Further, of the invention based on conditional mutual information and the unsupervised feature selection approach of K-means, step 1) right Without the K-means clusters that label data collection carries out multiple different K values and different cluster centres, and obtain each cluster result. The present invention is first by K-means clustering algorithms to carrying out the different cluster of multiple initial value without label data collection.During initialization, The maximum cluster number and min cluster number of K-means clustering algorithm, and cluster number of times are artificially specified.Carry out each time During cluster, K-means algorithms randomly choose a number as the number of cluster between maximum cluster number and min cluster number K, and k point is randomly choosed in data set as initial barycenter, by K-means clustering algorithms, can obtain successively each The result of secondary cluster, i.e. class label C.

Further, of the invention based on conditional mutual information and the unsupervised feature selection approach of K-means, step 2) root According to step 1) the different cluster results that obtain, construct the feature vector chart of each feature successively for each cluster result.Right The construction of the feature vector chart of a certain feature in data set, is in the case of known to this feature lower eigenvalue and class label, incites somebody to action Each sample is used as a point, it is assumed that the class that certain sample is located contains x sample, then by the point corresponding to the sample and and The immediate x-1 sample point of its characteristic value is connected, more than all samples execution that data is concentrated under same feature Operation, you can construct the feature vector chart of this feature.

Further, of the invention based on conditional mutual information and the unsupervised feature selection approach of K-means, step 3) root According to step 2) feature vector chart that constructs, the modularization metric of each feature is calculated, computing formula is：

In formula, i, j are steps 2) two points in the feature vector chart that constructs；A_ijIt is the adjacent square of feature vector chart , if there is side, A from i to j in battle array_ij=1, it is otherwise 0；M is the sum for always connecting side in number, i.e. feature vector chart；k_iAnd k_j It is the number of degrees of node i and j respectively；Binary function δ (C_i,C_j) represent that if node i and j belong to same cluster, for 1, otherwise for 0；After feature vector chart according to each feature calculates respective modularization metric, all of modularization metric is entered Row normalization, obtains Q ', the feature corresponding to Q ' maximums is put in character subset.

Further, of the invention based on conditional mutual information and the unsupervised feature selection approach of K-means, step 4) root According to step 3) the initial characteristicses subset that obtains, calculate condition mutual trust of each residue character relative to each feature in character subset Breath, so as to calculate correlation independence metric of each residue character relative to character subset, computing formula is：

In formula, f_rIt is the residue character for not being selected into character subset, f_jIt is the feature in character subset, S is feature Collection；Wherein RI (f_r,f_j) represent residue character f_rRelative to one of feature in character subset f_jCorrelation independence, computing formula For：

In formula, H (C) is the entropy of target variable C, I (f_r；C|f_j) and I (f_j；C|f_i) it is feature f_rWith feature f_jCondition Mutual information, computing formula is：

In formula, N is the number of sample in data set, and C is the quantity of class.Each residue character is calculated relative to feature After the correlation independence metric of subset, all of correlation independence metric is normalized, I is obtained_ri'.

Further, of the invention based on conditional mutual information and the unsupervised feature selection approach of K-means, step 5) will Step 3) the normalizing block metric of each residue character and the step 4 that obtain) obtain the standardization of each residue character Correlation independence metric is added with certain weight, i.e.,：S=wQ'+ (1-w) I_ri', the w people in formula is specified, span For [0,1], using result of calculation as each residue character score.

Further, of the invention based on conditional mutual information and the unsupervised feature selection approach of K-means, step 6) will Step 5) feature corresponding to the s maximums that obtain is put in character subset, is then made iteratively step 4), step 5), step Rapid 6) Characteristic Number in character subset reaches required number, and Characteristic Number is artificially specified.

Further, of the invention based on conditional mutual information and the unsupervised feature selection approach of K-means, step 7) will Step 6) character subset formed according to different K-means cluster results that obtains collected, according to required feature Number selects the most several features of occurrence number, constitutes final character subset.

Beneficial effect

The present invention is directed to the feature selection issues without label data collection in machine learning, by K-means algorithms and feature Between conditional mutual information combine, contribute to selecting and concentrate most important feature without label data.The method is by repeatedly The different K-means algorithms of primary condition, to clustering without label data collection, can eliminate The randomness of selection is levied, impact of the data nonbalance to feature selecting is reduced, and conventional feature selection approach is compensate for injustice Weighing apparatus data set features Selection effect is undesirable or defect that be only applicable to label data collection；Meanwhile, in order to obtain the degree of correlation high, The little character subset of redundancy, this method consider the modularization metric of each feature on the basis of cluster each time And the conditional mutual information between different characteristic, degree of correlation height and redundancy is selected using the correlation independence index between feature The little combinations of features of degree, by being collected the character subset for repeatedly extracting, obtains final character subset.K-means Algorithm and the combination of conditional mutual information so that this feature selection algorithm both can apply to balance or nonequilibrium without label data Collection, and the degree of correlation of energy lifting feature subset, reduce its redundancy, so as to select most important characteristic set.

Description of the drawings

Fig. 1 is the flow chart of the unsupervised feature selection approach based on conditional mutual information and K-means.

Fig. 2 is the example to data set structural feature vectogram.

Specific embodiment

Below in conjunction with the accompanying drawings the enforcement of technical scheme is described in further detail：

In conjunction with flow chart and case study on implementation to of the present invention based on conditional mutual information and the unsupervised feature of K-means System of selection is described in further detail.

The implementation case carries out feature selecting using conditional mutual information and K-means algorithms to the data set without label.Such as Shown in Fig. 1, this method is comprised the steps of：

Step 10, to carrying out the K-means clusters of multiple different K values and different cluster centres without label data collection, and obtains Obtain cluster result every time；

Step 101, maximum cluster number MAX of K-means algorithms and min cluster number MIN are advance in input phase Given, before clustering every time, randomly choose a number in the range of [MAX, MIN] as number k of cluster, and in data set with Machine selects k point as initial barycenter；

Step 102, the total degree T for carrying out K-means clustering algorithms are previously given in input phase, often execute one Secondary K-means algorithms, can obtain a group cluster result i.e. class label C, repeat K-means clusters, until clustering number of times Total degree set in advance is reached, the different cluster result of T groups may finally be obtained；

Step 20, according to cluster result obtained in the previous step, constructs each feature for each cluster result successively Feature vector chart；

Data are concentrated the construction of the feature vector chart of a certain feature, are the features of sample under this feature by step 201 In the case of value and class label are known, first using each sample as a point, the number comprising two features as shown in Figure 2 According to one sample of each round dot on right side and square point expression, the numeral for putting side represent the big of the characteristic value corresponding to point Little；

Step 202, if the total sample number that the class that certain sample is located includes is x, by the point corresponding to the sample with X-1 sample point immediate with its characteristic value is connected, as shown in Fig. 2 the class that sample 1 is located is C1, the sample that C1 classes include Sum is 4, then by the point corresponding to sample 1 and and immediate 3 sample points of its characteristic value, i.e. sample 2, sample 7, sample 6 are connected；

Step 203, to data set under same feature in all sample execution steps 202 operation, you can construct this The feature vector chart of feature；

Step 204, the operation that all feature execution steps 201-203 are concentrated to data, you can construct all features Feature vector chart, as shown in Fig. 2 data set of the left side comprising 2 features, after a K-means cluster of step 10 Class label C1 and C2 are obtained, right side is feature 1 and the feature vector chart corresponding to feature 2 respectively；

Step 30, according to the feature vector chart that previous step is constructed, calculates the modularization metric of each feature, and by mould The maximum feature of block metric is put in character subset；

Step 301, according to formulaCalculate the respective modularization degree of each feature Value；

Step 302, the modularization metric of each feature is normalized, Q ' is obtained；

Step 303, the feature corresponding to Q ' maximums is put in character subset, and which is deleted from residue character；

Step 40, according to character subset obtained in the previous step, calculates correlation of each residue character relative to character subset Independence measurement value；

Step 401, according to conditional mutual information formula Calculate I (f_r；C|f_j) and I (f_j；C|f_i) value, i.e., residue character with select the conditional mutual information of feature；

Step 402, according to formulaEach residue character is calculated relative to feature The correlation independence of a certain feature in subset；

Step 403, according to formulaEach residue is calculated relative to character subset Correlation independence metric；

Step 404, the correlation independence metric of each residue character is normalized, I is obtained_ri'；

Step 50, by the modularization metric Q ' of each residue character obtained according to step 30 and step 40 obtain every The correlation independence metric I of individual feature_ri' be added with certain weight, using result of calculation as each residue character score；

Weight w of step 501, modularization metric and correlation independence metric is preset in input phase, value Scope is [0,1], and default setting is 0.3；

Step 502, according to formula s=wQ'+ (1-w) I_ri', calculate the score of each residue character；

Step 60, the feature of previous step highest scoring is put in character subset, and which is deleted from residue character, weight Multiple execution step 40, step 50, step 60, the Characteristic Number in character subset reach required number, the spy of needs Levy number a to preset in input phase；

Step 70, the character subset formed according to different K-means cluster results obtained in the previous step is collected, root A most feature of occurrence number is selected according to the Characteristic Number for needing, final character subset is constituted and export.

Claims

1. a kind of based on conditional mutual information and the unsupervised feature selection approach of K-means, it is characterised in that including following step Suddenly：

Step 1), to carrying out the K-means clusters of multiple different K values and different cluster centres without label data collection, and obtain every Secondary cluster result；

Step 2), according to step 1) the different cluster results that obtain, each feature is constructed for each cluster result successively Feature vector chart；

Step 3), according to step 2) feature vector chart that constructs, calculate the modularization metric of each feature, and by modularization The maximum feature of metric is put in character subset；

Step 4), according to step 3) the initial characteristicses subset that obtains, calculate each residue character relative in character subset each The conditional mutual information of feature, so that calculate correlation independence metric of each residue character relative to character subset；

Step 5), by step 3) the modularization metric of each residue character and the step 4 that obtain) the correlation independence degree that obtains Value is added with certain weight, using result of calculation as each residue character score；

Step 6), by step 5) feature of highest scoring that obtains is put in character subset, is then made iteratively step 4), step Rapid 5), step 6), the Characteristic Number in character subset reach required for number；

Step 7), by step 6) obtain according to different K-means cluster results formed character subset collected, obtain most Whole character subset.

2. the method for claim 1, it is characterised in that step 1) to without label data collection carry out multiple different K values and The K-means clusters of different cluster centres, and obtain each cluster result；During initialization, K-means clusters are artificially specified The maximum cluster number of algorithm and min cluster number, and cluster number of times；When being clustered each time, K-means algorithms exist A number is randomly choosed as the number k of cluster between maximum cluster number and min cluster number, and is selected in data set at random K point is selected as initial barycenter, by K-means clustering algorithms, the result for being clustered each time successively, i.e. class label C.

3. the method for claim 1, it is characterised in that further, step 2) according to step 1) difference that obtains gathers Class result, constructs the feature vector chart of each feature successively for each cluster result；The spy that a certain feature is concentrated to data The construction of vectogram is levied, is in the case of known to this feature lower eigenvalue and class label, using each sample as a point, vacation If the class that certain sample is located contains x sample, then by the point corresponding to the sample with and the immediate x-1 of its characteristic value individual Sample point is connected, and executes above operation to all samples that data are concentrated, you can construct this feature under same feature Feature vector chart.

4. the method for claim 1, it is characterised in that step 3) according to step 2) feature vector chart that constructs, meter The modularization metric of each feature is calculated, computing formula is：

Q = \underset{i j}{Σ} [\frac{A_{i j}}{2 M} - \frac{k_{i} * k_{j}}{(2 M) * (2 M)}] δ (C_{i}, C_{j})

In formula, i, j are steps 2) two points in the feature vector chart that constructs；A_ijIt is the adjacency matrix of feature vector chart, If there is side, A from i to j_ij=1, it is otherwise 0；M is the sum for always connecting side in number, i.e. feature vector chart；k_iAnd k_jPoint It is not the number of degrees of node i and j；Binary function δ (C_i,C_j) represent that if node i and j belong to same cluster, for 1, it is otherwise 0； After feature vector chart according to each feature calculates respective modularization metric, all of modularization metric is carried out Normalization, obtains Q ', the feature corresponding to Q ' maximums is put in character subset.

5. the method for claim 1, it is characterised in that step 4) according to step 3) the initial characteristicses subset that obtains, meter Conditional mutual information of each residue character relative to each feature in character subset is calculated, relative so as to calculate each residue character In the correlation independence metric of character subset, computing formula is：

I_{r i} (f_{r}; C | S) = \underset{f_{j} &Element; S}{Σ} R I (f_{r}, f_{j})

In formula, f_rIt is the residue character for not being selected into character subset, f_jIt is the feature in character subset, S is character subset；Its Middle RI (f_r,f_j) represent residue character f_rRelative to one of feature in character subset f_jCorrelation independence, computing formula is：

R I (f_{r}, f_{j}) = \frac{I (f_{r}; C | f_{j}) + I (f_{j}; C | f_{i})}{2 H (C)}

In formula, H (C) is the entropy of target variable C, I (f_r；C|f_j) and I (f_j；C|f_i) it is feature f_rWith feature f_jCondition mutual trust Cease, computing formula is：

I (X_{i}; Y | X_{j}) = Σ_{i = 1}^{N} Σ_{j = 1}^{N} Σ_{k = 1}^{C} p (x_{i}, x_{j}, y_{k}) \log \frac{p (x_{i}, y_{k} | x_{j})}{p (x_{i} | x_{j}) p (y_{k} | x_{j})}

In formula, N is the number of sample in data set, and C is the quantity of class.Each residue character is calculated relative to character subset Correlation independence metric after, all of correlation independence metric is normalized, I is obtained_ri'.

6. the method for claim 1, it is characterised in that step 5) by step 3) specification of each residue character that obtains Change modularization metric and step 4) the standardization correlation independence metric that obtains each residue character is added with certain weight, I.e.：S=wQ'+ (1-w) I_ri', the w people in formula is specified, and span is [0,1], and result of calculation is remaining special as each The score that levies.

7. the method for claim 1, it is characterised in that step 6) by step 5) spy corresponding to the s maximums that obtain Levy and be put in character subset, be then made iteratively step 4), step 5), step 6), the Characteristic Number in character subset Number required for reaching, Characteristic Number are artificially specified.

8. the method for claim 1, it is characterised in that step 7) by step 6) obtain poly- according to different K-means The character subset that class result is formed is collected, and selects the most several features of occurrence number according to required Characteristic Number, Constitute final character subset.