CN105975589A

CN105975589A - Feature selection method and device of high-dimension data

Info

Publication number: CN105975589A
Application number: CN201610298079.XA
Authority: CN
Inventors: 孙广路; 宋智超; 陈腾; 何勇军
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2016-05-06
Filing date: 2016-05-06
Publication date: 2016-09-28

Abstract

The invention discloses a feature selection method and device of high-dimension data. The method comprises the following steps: obtaining an original data set to be processed, wherein the original data set comprises a feature set, a plurality of samples and a category set, and the category set comprises the category of each sample; calculating to obtain a MIC (Maximum Information Coefficient) between each feature in the feature set and the category set, and the redundant value of each feature and a selected feature subset; and according to the MIC and the redundant value, obtaining the effective value of each feature, and selecting the feature subset from the feature set according to the effective value. The MIC is introduced into feature selection, and the feature is effectively evaluated on the basis of the MIC so as to select features according to the effective value generated by evaluation. Compared with the prior art, the feature selection method can effectively improve accuracy for high-dimension data feature selection.

Description

The feature selection approach of a kind of high dimensional data and device

Technical field

The present invention relates to data mining technology field, be specifically related to feature selection approach and the device of a kind of high dimensional data.

Background technology

The information-intensive society developed rapidly all is producing the data of magnanimity every day, excavates the most rapidly and have from these data Information become urgent problem.Researchers solve this problem from the angle of machine learning model, and obtain Remarkable break-throughs.But, the model of high complexity and high-dimensional feature space are increasingly difficult in adapt to compeling of big market demand Highly necessary ask, and feature space often also exists a large amount of garbage.Only use suitable feature selection approach, Cai Nengcong Mass data obtains effective feature, and then improves efficiency and the accuracy rate of machine learning model process data；Feature simultaneously Selection can also prevent model over-fitting and carry out denoising.Important accordingly, as of machine learning and data mining Pre-treatment step, feature selection always is the study hotspot in machine learning field.

Choosing of the module of feature selection and searching algorithm is most important.Conventional module have based on distance, Theory of information and conforming module.Module based on distance, Pearson coefficient isometry standard can only weigh variable Between linear relationship, and information gain, mutual information isometry standard, non-linear relation can be measured.Generating spy When levying subset, generally require the corresponding searching algorithm of use, numerous search strategys approximates Markov blanket condition at meter All well and good performance is had on the classification accuracy of the feature calculating complexity and selection.But it also has obvious shortcoming, it is impossible to Consider the redundancy between feature and character subset.

Summary of the invention

For defect of the prior art, the invention provides feature selection approach and the device of a kind of high dimensional data, pin MIC can only be incorporated in feature selection by the tolerance in current techniques by the linear processes relationship metric between variable, MIC can linear processes relation between gauge variable widely, even can measure and single function can not be used to represent Non-functional relation.Although MIC is largely effective on variable metric, but the dependency between single variable and redundancy can only be measured Property, therefore set forth herein a kind of new tolerance mMIC (virtual value), and be applied to Markov blanket condition, to solve prior art Cause feature selection degree of accuracy low because being difficult to be suitable for the redundancy between feature and the character subset that high dimensional data is concentrated Problem.

The present invention proposes the feature selection approach of a kind of high dimensional data, including:

Obtaining pending raw data set, described raw data set includes feature set, some samples and classification collection, institute State classification collection and include the classification of each sample；

Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition, and each Individual feature and the redundancy value selecting character subset；

According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and according to described Virtual value selects character subset from feature set.

Preferably, described calculating obtains the maximum information coefficient in described feature set between each feature and classification collection The step of MIC specifically includes:

By formula (one), calculate the maximum information coefficient between each feature and classification collection in the described feature set of acquisition MIC；

Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n^1-ε), 0 ＜ ε ＜ 1, the number that n is characterized, x is The hop count dividing n feature, y is the hop count dividing n sample, M (D)_x,yRepresent that feature and sample are at x*y stress and strain model Value after the mutual information normalization of lower maximum.

Preferably, described according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature Step specifically include:

By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value；

Wherein, S_mainFor the character subset currently selected, S_residueFor residue character subset, i and j represents feature f respectively_i And f_j, c is classification collection,For redundancy value.

Preferably, described according to described maximum information coefficient MIC and described redundancy value, the effective of each feature is obtained Before the step of value, the method also includes:

Define the approximation Markov blanket condition between two features:

MIC(f_i, c) ＞ MIC (f_j, c) and MIC (f_j, c) ＜ MIC (f_i,f_j)

Correspondingly, described according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value, and from feature set, select the step of character subset according to described virtual value and specifically include:

According to described maximum information coefficient MIC selected characteristic successively from feature set, and by the feature chosen from feature set Middle deletion；

Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and judge institute State whether virtual value is more than or equal to predetermined threshold value, the most then this feature is added to optimal subset.

Preferably, described according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value, and from feature set, select the step of character subset according to described virtual value and also include:

Filter out from feature set according to described approximation Markov blanket condition and have approximation Ma Er with the described feature chosen All features of section husband blanket condition, and the virtual value of each feature filtered out is obtained according to formula two；

Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will The feature filtered out is deleted from feature set, and chooses next feature from feature set.

The invention allows for the feature selection device of a kind of high dimensional data, it is characterised in that including:

Acquisition module, for obtaining pending raw data set, described raw data set includes feature set, some samples And classification collection, described classification collection includes the classification of each sample；

Processing module, for calculating the maximum information coefficient obtaining in described feature set between each feature and classification collection MIC, and each feature and the redundancy value having selected character subset；

Select module, for according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value, and from feature set, select character subset according to described virtual value.

Preferably, described processing module, specifically for by formula (one), calculate and obtain each spy in described feature set Levy the maximum information coefficient MIC between classification collection；

Preferably, described selection module, specifically for by formula (two), according to described maximum information coefficient MIC and institute State redundancy value, obtain the virtual value of each feature；

Preferably, this device also includes: predefined module；

Described predefined module, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains each Before the step of the virtual value of individual feature, define the approximation Markov blanket condition between two features:

MIC(f_i, c) ＞ MIC (f_j, c) and MIC (f_j, c) ＜ MIC (f_i,f_j)

Correspondingly, described selection module, it is additionally operable to from feature set, choose spy successively according to described maximum information coefficient MIC Levy, and the feature chosen is deleted from feature set；Maximum information coefficient MIC and redundancy value according to the feature chosen obtain institute State the virtual value of feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added extremely by described virtual value Optimal subset.

Preferably, described selection module, it is additionally operable to filter out from feature set according to described approximation Markov blanket condition There are all features approximating Markov blanket condition with the described feature chosen, and obtain what each filtered out according to formula two The virtual value of feature；Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, Then the feature filtered out is deleted from feature set, and from feature set, choose next feature

As shown from the above technical solution, the feature selection approach of the high dimensional data that the present invention proposes, by maximum information system Number is incorporated in feature selection, is simultaneously based on maximum information and high dimensional data is carried out feature selection, only to overcome prior art Dependency and the shortcoming of redundancy between two features can be considered, improve the classification accuracy of the feature of selection.

Accompanying drawing explanation

By being more clearly understood from the features and advantages of the present invention with reference to accompanying drawing, accompanying drawing is schematic and should not manage Solve as the present invention is carried out any restriction, in the accompanying drawings:

Fig. 1 shows the schematic flow sheet of the feature selection approach of a kind of high dimensional data that one embodiment of the invention proposes；

Fig. 2 shows the flow process signal of the feature selection approach of a kind of high dimensional data that another embodiment of the present invention proposes Figure；

Fig. 3 shows the structural representation of the feature selection device of a kind of high dimensional data that one embodiment of the invention proposes.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is A part of embodiment of the present invention rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained on the premise of not making creative work, broadly falls into the scope of protection of the invention.

Fig. 1 is the schematic flow sheet of the feature selection approach of a kind of high dimensional data that one embodiment of the invention proposes, reference Fig. 1, the feature selection approach of this high dimensional data, including:

110, obtaining pending raw data set, described raw data set includes feature set, some samples and classification Collection, described classification collection includes the classification of each sample；

120, the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition is calculated, and Each feature and the redundancy value selecting character subset；

130, according to described maximum information coefficient MIC and described redundancy value, the virtual value of each feature is obtained, and according to Described virtual value selects character subset from feature set.

The present invention is incorporated in feature selection by maximum information coefficient, is simultaneously based on maximum information and carries out high dimensional data Feature selection, because it is difficult to be suitable for the redundancy between feature and the character subset that high dimensional data is concentrated and cause feature selection essence The problem that exactness is low, improves the classification accuracy of the feature of selection.

In the present embodiment, the process calculating MIC in step 120 specifically includes:

In the present embodiment, step 130 specifically includes:

In the present embodiment, before step 130, the method also includes:

Define the approximation Markov blanket condition between two features:

MIC(f_i, c) ＞ MIC (f_j, c) and MIC (f_j, c) ＜ MIC (f_i,f_j)

Correspondingly, step 130 specifically includes:

Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and judge institute State whether virtual value is more than or equal to predetermined threshold value, the most then this feature is added to optimal subset；

Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will The feature filtered out is deleted from feature set, and chooses next feature from feature set, until feature set F is empty.

Fig. 2 is the schematic flow sheet of the feature selection approach of a kind of high dimensional data that another embodiment of the present invention proposes, under The principle of the present invention is described in detail by face with reference to Fig. 2:

The method includes that initial phase and feature delete the stage；

One, initial phase includes:

S1, given data set D have m feature and n sample, and its feature set comprised is F={f₁,f₂,...,f_m, Classification collection c={c₁,c₂,...,c_nInclude the classification of each sample in data set.Carry out data prediction, optimal characteristics is set Subset S is empty, and setup parameter θ, parameter θ herein is above-mentioned predetermined threshold value；

Two, feature is deleted the stage and is comprised step:

S2, calculate the maximum information coefficient between each feature and classification collection in feature set, and according to feature and classification collection MIC (c；f_i) value carries out descending sort, wherein, f to feature_iFor ith feature, i is more than 0 and less than or equal to m；

S3, propose approximation Markov blanket condition and virtual value mMIC evaluation function according to the present invention, feature set is entered Row processes, and deletes unrelated and redundancy feature, obtains last character subset；

Preferably, step S1 specifically includes:

S11, data set D is carried out data prediction, obtain the file format of requirement；

S12, optimal feature subset S is initialized as empty set, parameter θ is initialized；

Preferably, step S2 specifically includes:

S21, to arbitrary characteristics f in feature set F_i, calculate the maximum information coefficient value MIC between this feature and classification collection (c；f_i)；

S22, according to MIC (c；f_i) feature is carried out descending sort；

Preferably, the approximation Markov blanket conditional definition described in step S3 is as follows:

For two features f_iAnd f_j(i ≠ j, j are more than 0 and less than or equal to m) and classification c, f_iIt is f_jApproximation Ma Erke The condition of husband's blanket is:

MIC(f_i, c) ＞ MIC (f_j, c) and MIC (f_j, c) ＜ MIC (f_i,f_j)。

Thus, the computing formula of maximum information coefficient is as follows:

M I C (D) = \underset{x y < B (n)}{m a x} {M {(D)}_{x, y}}

Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n^1-ε), 0 ＜ ε ＜ 1.Usually, B (n)=n⁰ _. ⁶ Shi Xiaoguo is best.X Yu y represents the hop count dividing two variable codomains.M (D) in formula_x,yRepresent that two variablees are drawn at x*y grid The value after mutual information normalization maximum under point.

M(D)_x,yComputing formula as follows:

M {(D)}_{x, y} = \frac{{MI}^{*} (D, x, y)}{\log \min {x, y}}

Wherein, MI^*(D, x y) represent the mutual information of maximum under x*y stress and strain model.

MI^*(D, x, computing formula y) is as follows:

MI^*(D, x, y)=maxMI (D | G)

Wherein, D | G is that data set D uses G (x*y grid) to divide, and then solves the mutual information of each grid.And formula The computing formula of middle mutual information is as follows:

M I (A, B) = Σ p (a, b) l o g \frac{p (a, b)}{p (a) p (b)}

Wherein A={a_i, i=1...n} and B={b_i, i=1...n}.

Preferably, evaluation function mMIC based on maximum information coefficient in step S3, can be between feature and classification Dependency between dependency and feature and character subset is measured, and then the quality of judging characteristic.

The computing formula of mMIC evaluation function is as follows:

\underset{j &Element; S_{r e s i d u e}}{m M I C (j) = M I C (j, c)} - \frac{1}{| S_{m a i n} |} \underset{i &Element; S_{m a i n}}{Σ} M I C (j, i)

Wherein, S_mainFor the character subset currently selected, S_residueFor residue character subset.On simplifying and stating Convenience uses i and j to represent feature f respectively_iAnd f_j.Above formula represents feature f selected from residue character subset_jIts quality is passed through This feature determines with the current redundancy having character subset with the dependency of classification collection and this feature.

Preferably, step S3 comprises step:

S31, repetition operations described below are until F is empty set；

A. from feature set F, select MIC (c；f_i) the maximum feature of value；

B. feature f is deleted from feature set F_iIf it is in redundancy subset S_reIn, then calculate the mMIC value of this feature, if MMIC value is less than θ, returns to step a；Otherwise directly by f_iAdd in optimal subset S, and by f_iContinue executing with as host element Step c；

C. from feature set F, search for the host element f to select in a_iFor approximating all elements of Markov blanket condition, will Feature f selected_jJoin S_reIn and calculate the mMIC value of all elements selected.If feature f_jMMIC value less than θ then By feature f_jDelete from F；

D., after said process terminates, the character subset S of output is optimal feature subset.

In sum, the present invention by joining in approximation Markov blanket model by mMIC so that approximation Markov Blanket condition can weigh the power of the redundancy between the dependency between single feature and classification and this feature and character subset, Determine the going or staying of feature.Both ensure that approximation Markov blanket condition carried out the efficiency of feature selection and also ensure that the spy selected Levy the accuracy of selection.

Fig. 3 is the structural representation of the feature selection device of a kind of high dimensional data that one embodiment of the invention proposes, reference Fig. 3, this device includes:

Acquisition module 310, for obtaining pending raw data set, described raw data set includes feature set, some Sample and classification collection, described classification collection includes the classification of each sample；

Processing module 320, for calculating the maximum information obtaining in described feature set between each feature and classification collection Coefficient MIC, and each feature and the redundancy value having selected character subset；

Select module 330, for according to described maximum information coefficient MIC and described redundancy value, obtain each feature Virtual value, and from feature set, select character subset according to described virtual value.

The present invention is incorporated in feature selection by maximum information coefficient, is simultaneously based on maximum information and carries out high dimensional data Feature selection, can only consider the shortcoming of dependency and redundancy between two features overcoming prior art, improve selection The classification accuracy of feature.

For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple, Relevant part sees the part of method embodiment and illustrates.

In a possible embodiments, described processing module 320, specifically for by formula (one), calculate and obtain described spy Maximum information coefficient MIC between each feature and classification collection in collection；

In a possible embodiments, described selection module 330, specifically for by formula (two), according to described maximum letter Breath coefficient MIC and described redundancy value, obtain the virtual value of each feature；

In a possible embodiments, this device also includes: predefined module 340；

Described predefined module 340, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains Before the virtual value of each feature, define the approximation Markov blanket condition between two features:

MIC(f_i, c) ＞ MIC (f_j, c) and MIC (f_j, c) ＜ MIC (f_i,f_j)

Correspondingly, described selection module 330, it is additionally operable to select successively from feature set according to described maximum information coefficient MIC Take feature, and the feature chosen is deleted from feature set；Maximum information coefficient MIC and redundancy value according to the feature chosen obtain Take the virtual value of described feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added by described virtual value Add to optimal subset；Filter out from feature set according to described approximation Markov blanket condition and have approximation with the described feature chosen All features of Markov blanket condition, and the virtual value of each feature filtered out is obtained according to formula two；According to effectively Value judge the virtual value of feature that filters out whether more than or equal to predetermined threshold value, if it is not, then by the feature that filters out from spy Collection is deleted, and from feature set, chooses next feature, until described feature set is empty.

In a possible embodiments, described selection module 330, it is additionally operable to the virtual value in this feature and is less than predetermined threshold value Time, from feature set, choose next feature.

It should be noted that, in all parts of assembly of the invention, the function to be realized according to it and to therein Parts have carried out logical partitioning, but, the present invention is not only restricted to this, can as required all parts be repartitioned or Person combines.

The all parts embodiment of the present invention can realize with hardware, or to transport on one or more processor The software module of row realizes, or realizes with combinations thereof.In this device, PC is by realizing the Internet to equipment or device Remotely control, control equipment or the step of each operation of device accurately.The present invention is also implemented as performing here Part or all equipment of described method or device program (such as, computer program and computer program product Product).It is achieved in that the program of the present invention can store on a computer-readable medium, and the file or document tool that program produces Have and statistically can produce data report and cpk report etc., power amplifier can be carried out batch testing and add up.It should be noted The present invention will be described rather than limits the invention to state embodiment, and those skilled in the art are without departing from institute Replacement embodiment can be designed in the case of the scope of attached claim.In the claims, should not will be located between bracket Any reference marks be configured to limitations on claims.Word " comprises " and does not excludes the presence of the unit not arranged in the claims Part or step.Word "a" or "an" before being positioned at element does not excludes the presence of multiple such element.The present invention can borrow Help include the hardware of some different elements and realize by means of properly programmed computer.If listing equipment for drying Unit claim in, several in these devices can be specifically to be embodied by same hardware branch.Word first, Second and third use do not indicate that any order.Can be title by these word explanations.

Although being described in conjunction with the accompanying embodiments of the present invention, but those skilled in the art can be without departing from this Making various modifications and variations in the case of bright spirit and scope, such amendment and modification each fall within by claims Within limited range.

Claims

1. the feature selection approach of a high dimensional data, it is characterised in that including:

Obtaining pending raw data set, described raw data set includes feature set, some samples and classification collection, described class Ji not include the classification of each sample；

Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition, and each is special The redundancy value levied and select character subset；

According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and according to described effectively Value selects character subset from feature set.

Method the most according to claim 1, it is characterised in that described calculating obtain in described feature set each feature with The step of the maximum information coefficient MIC between classification collection specifically includes:

By formula (one), calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition；

Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n^1-ε), 0 ＜ ε ＜ 1, the number that n is characterized, x is to n The hop count that individual feature divides, y is the hop count dividing n sample, M (D)_x,yRepresent feature and sample under x*y stress and strain model The big value after mutual information normalization.

Method the most according to claim 1, it is characterised in that described according to described maximum information coefficient MIC with described superfluous Residual value, the step of the virtual value obtaining each feature specifically includes:

By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature；

Wherein, S_mainFor the character subset currently selected, S_residueFor residue character subset, i and j represents feature f respectively_iAnd f_j, c For classification collection,For redundancy value.

Method the most according to claim 3, it is characterised in that described according to described maximum information coefficient MIC with described Redundancy value, before the step of the virtual value obtaining each feature, the method also includes:

Define the approximation Markov blanket condition between two features:

MIC(f_i, c) ＞ MIC (f_j, c) and MIC (f_j, c) ＜ MIC (f_i,f_j)

Correspondingly, described according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and The step selecting character subset according to described virtual value from feature set specifically includes:

According to described maximum information coefficient MIC selected characteristic successively from feature set, and the feature chosen is deleted from feature set Remove；

Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and have described in judgement Whether valid value more than or equal to predetermined threshold value, the most then adds this feature to optimal subset.

Method the most according to claim 4, it is characterised in that described according to described maximum information coefficient MIC with described superfluous Residual value, obtains the virtual value of each feature, and selects the step of character subset from feature set also according to described virtual value Including:

Filter out from feature set according to described approximation Markov blanket condition and have approximation Markov with the described feature chosen All features of blanket condition, and the virtual value of each feature filtered out is obtained according to formula two；

Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will screening The feature gone out is deleted from feature set, and chooses next feature from feature set.

6. the feature selection device of a high dimensional data, it is characterised in that including:

Acquisition module, for obtaining pending raw data set, described raw data set include feature set, some samples and Classification collection, described classification collection includes the classification of each sample；

Processing module, for calculating the maximum information coefficient MIC obtaining in described feature set between each feature and classification collection, And each feature and the redundancy value having selected character subset；

Select module, for according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, And from feature set, select character subset according to described virtual value.

Device the most according to claim 6, it is characterised in that described processing module, specifically for by formula (one), meter Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition；

Device the most according to claim 6, it is characterised in that described selection module, specifically for by formula (two), root According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature；

Device the most according to claim 8, it is characterised in that this device also includes: predefined module；

Described predefined module, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains each special Before the step of the virtual value levied, define the approximation Markov blanket condition between two features:

MIC(f_i, c) ＞ MIC (f_j, c) and MIC (f_j, c) ＜ MIC (f_i,f_j)

Correspondingly, described selection module, it is additionally operable to according to described maximum information coefficient MIC selected characteristic successively from feature set, And the feature chosen is deleted from feature set；Maximum information coefficient MIC and redundancy value according to the feature chosen obtain described The virtual value of feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added to by described virtual value Excellent subset.

Device the most according to claim 9, it is characterised in that described selection module, is additionally operable to according to described approximation Ma Er Section husband blanket condition filters out from feature set all features approximating Markov blanket condition with the described feature chosen, and root The virtual value of each feature filtered out is obtained according to formula two；Judge the virtual value of feature that filters out whether according to virtual value More than or equal to predetermined threshold value, if it is not, then the feature filtered out deleted from feature set, and choose next from feature set Individual feature.