CN108509996A

CN108509996A - Feature selection approach based on Filter and Wrapper selection algorithms

Info

Publication number: CN108509996A
Application number: CN201810287707.3A
Authority: CN
Inventors: 廖伟智; 严伟军; 阴艳超; 张强; 曹奕翎
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-04-03
Filing date: 2018-04-03
Publication date: 2018-09-07

Abstract

The invention discloses a kind of feature selection approach based on Filter and Wrapper selection algorithms, it includes importing whole character subsets, diverging feature is screened using variance method, nonredundancy feature is screened using Pearson correlation coefficient method, new character subset is generated using feature space searching method, using neural metwork training learning model, the evaluation criterion of construction feature subset exports character subset.The advantages of present invention incorporates Filter selection algorithms and Wrapper selection algorithms while improving efficiency of algorithm, reduces calculating cost using the complementary characteristic of the two.

Description

Feature selection approach based on Filter and Wrapper selection algorithms

Technical field

The invention belongs to machine learning techniques fields, and in particular to a kind of based on Filter and Wrapper selection algorithms Feature selection approach.

Background technology

Feature selecting is one of important subjects in fields such as machine learning, pattern-recognition and statistics.Feature selecting It refer to the feature set that selection obtains corresponding model and algorithm top performance.When usually we build machine learning algorithm, Ke Yishou Collect the data information of many dimensions, but after characteristic dimension reaches certain magnitude, whole features are put into algorithm can band Carry out dimension disaster, algorithm is in limited computing capability and is extremely difficult to restrain in the time, in some instances it may even be possible to can calculation overflow.In face of this When kind problem, feature selecting just becomes particularly significant.

In general, data and feature determine the upper limit of machine learning, and model and algorithm only approach this upper limit , so, feature selecting occupies considerable status in machine learning.As computer technology is each in human society The application in field, such as social networks, combinatorial chemistry, more and more data acquisition systems have the features of thousands of dimensions in real Space.But in practical situations, the feature for really being able to characterization things essence is wherein sub-fraction, this sub-fraction Feature is buried among a large amount of uncorrelated, redundancy feature, has severely impacted the performance of machine learning algorithm.It is so far Only, feature selecting and machine learning are combined by researcher, are carried out in fields such as Video Events monitoring, biomedical diagnostics It is widely applied.

According to the mode being combined with machine learning algorithm, feature selecting algorithm can be divided into three classes：Filter patterns, Wrapper patterns and Embedded patterns.Feature ordering (Feature can be divided into from the output form of feature selecting algorithm Ranking) and subset selects (Subset Selection) two major classes.Feature ordering be according to certain module (such as away from From, correlation etc.) sequencing that provides feature significance level, before follow-up learning algorithm is intercepted most by the methods of given threshold Multiple features in face are as input.Such as Relief series of features selection algorithms belong to it is such.Feature subset selection is from rule The optimal feature subset being consistent with setting models is chosen in fixed feature space.

Wrapper patterns (pack) are exactly selected special algorithm, then carry out comparative feature set further according to algorithm effect. I.e. by continuous heuristic come search characteristics subset, the character subset searched for every time is put into training in learning model, According to the quality of the comparative feature subset of model.

Feature selecting algorithm is formed as the inside of machine learning in Embedded patterns (embedding inlay technique), feature selection process It is synchronously completed with model learning process.Such as in decision tree problem, algorithm is when choosing decision feature, the information that uses subtree Also The strongest feature of gain standard selection sort ability.

Filter patterns (filtration method) feature selecting algorithm is independent from each other with specific algorithm, as shown in Figure 2.According to Diversity or correlation score to each feature, and given threshold or the number for selecting threshold value complete feature selecting.Mainly Method have variance method, correlation coefficient process, reciprocity method etc..

The prior art has following defect：

The method of 1.Embedded patterns depends on machine learning algorithm, applicability not high.

2.Wrapper selection algorithms are less efficient, because it needs to be carried out with each candidate feature subset training pattern Assessment.In this mode, since exhaustive all characteristic sets are impossible, heuristic search is doubled by widely Result for the fast search of character subset, but the search of this method does not ensure that optimal.Such as：Using sweep forward Method gives in characteristic set { X1, X2, Xn }, and it is optimal to carry out evaluation hypothesis { X2 } to this n character subset, then Candidate Set by { X2 } as the first round；Then a feature is added in last round of selected concentration, it includes two features to constitute Candidate subset, it is assumed that concentrated in this n-1 candidate feature and find that { X2, X3 } is optimal, and due to { X2 }；Then incite somebody to action X2, X3 } selected collection as epicycle；If finding that X5 is better than X6 in third round, it is { X2, X3, X5 } then to select collection, however the Four-wheel but may be { X2, X3, X6, X9 } all more excellent than all { X2, X3, X5, Xi }.

3.Filter selection algorithm efficiency is higher, but it is difficult complete rejecting redundancy feature, and being easy will be in some Portion's dependence-producing property is rejected as redundancy feature, i.e., its differentiation performance is not high when some features are as single processing, and as whole Body but has very strong separating capacity.

Invention content

The present invention goal of the invention be：In order to solve problem above in the prior art, the present invention proposes one kind and is based on The feature selection approach of Filter and Wrapper selection algorithms.

The technical scheme is that：A kind of feature selection approach based on Filter and Wrapper selection algorithms, including Following steps：

A, whole character subsets are imported, initial parameter is set；

B, the mean value and variance that each feature in data set is calculated using variance method reject the feature that feature does not dissipate；

C, characteristic variable is related to the Pearson of target variable after calculating step B screenings using Pearson correlation coefficient method Coefficient rejects redundancy feature；

D, using whole character subsets after step C screenings as complete feature space, using improved LVW feature selectings Algorithm is handled；

E, new character subset is generated using feature space searching method；

F, using neural metwork training learning model；

G, the new character subset that step E is generated is calculated using cross-validation method to produce in the learning model that step F is generated Raw error；

H, the error for obtaining the error of current signature subset as evaluation criterion comparison step G；

I, judge whether current signature subset is empty set；If so, by variance threshold values progressive variance step-length, a phase relation The progressive related coefficient step-length of number, return to step B；If it is not, then obtaining completing the character subset of selection.

Further, the calculating of the mean value and variance of each feature in data set is calculated in the step B using variance method Formula is specially：

Wherein,Indicate the mean value of feature, X_iIndicate ith feature, S²Indicate that the variance of feature, n indicate feature quantity.

Further, the feature that feature does not dissipate is rejected in the step B is specially：Whether the variance of judging characteristic is less than The variance threshold values of setting；If so, this feature is deleted from data set；If it is not, then retaining this feature.

Further, characteristic variable and mesh after step B is screened are calculated using Pearson correlation coefficient method in the step C The calculation formula of Pearson correlation coefficient for marking variable is specially：

Wherein, cov (X, Y) indicates the covariance of characteristic variable X and target variable Y, ρ_X,YIndicate characteristic variable X and target The Pearson correlation coefficient of variable Y, σ_Xσ_YThe standard deviation of characteristic variable X and target variable Y are indicated respectively.

Further, redundancy feature is rejected in the step C is specially：The Pearson phases of judging characteristic and target variable Whether relationship number is less than the correlation coefficient threshold of setting；If so, this feature is deleted from data set；It should if it is not, then retaining Feature.

Further, the step E generates new character subset using feature space searching method and is specially：Using random Searching method, judges whether count parameter is more than the stop condition control parameter of setting；If so, carrying out step I；If it is not, then Carry out next step.

Further, the mistake that the step H obtains the error of current signature subset as evaluation criterion comparison step G Difference, specifically include it is following step by step：

H1, judge whether the error of new character subset is less than the error of current signature subset and the spy of new character subset It levies number and is less than the maximum Characteristic Number that learning model allows；If so, by count parameter zero, the error of current signature subset The Characteristic Number of the error, current signature subset that are set as new character subset be set as new character subset Characteristic Number, New character subset is as current signature subset；If it is not, then by count parameter progressive 1, return to step E；

H2, judge whether the error of new character subset is equal to the error of current signature subset and the spy of new character subset Levy the Characteristic Number that number is less than current signature subset；If so, count parameter zero, the error of current signature subset are set The Characteristic Number of the error, current signature subset that are set to new character subset is set as the Characteristic Number, new of new character subset Character subset as current signature subset；If it is not, then by count parameter progressive 1, return to step E.

The invention has the advantages that：

(1) present invention uses two benches feature selection approach that can provide feature selecting support for a variety of learning algorithms；

(2) the advantages of present invention incorporates Filter selection algorithms and Wrapper selection algorithms, it is special using the complementation of the two Property, while improving efficiency of algorithm, reduce calculating cost.

(3) present invention improves LVW algorithms, avoids the case where over-fitting occur in training pattern.

Description of the drawings

Fig. 1 is the flow diagram of the feature selection approach based on Filter and Wrapper selection algorithms of the present invention.

Fig. 2 is the specific stream of the feature selection approach based on Filter and Wrapper selection algorithms in the embodiment of the present invention Journey schematic diagram.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.

As shown in Figure 1, the flow for the feature selection approach based on Filter and Wrapper selection algorithms of the present invention is shown It is intended to.A kind of feature selection approach based on Filter and Wrapper selection algorithms, includes the following steps：

A, whole character subsets are imported, initial parameter is set；

E, new character subset is generated using feature space searching method；

F, using neural metwork training learning model；

As shown in Fig. 2, for the feature selection approach based on Filter and Wrapper selection algorithms in the embodiment of the present invention Idiographic flow schematic diagram.

In an alternate embodiment of the present invention where, it includes the variance threshold in variance method that initial parameter, which is arranged, in above-mentioned steps A Value A, variance step-length are a；The correlation coefficient threshold of Pearson correlation coefficient method is B, and related coefficient step-length is b；Stop condition control Parameter T processed, count parameter initial value t is zero；Allow the worst error E of the learning model occurred_max；In order to avoid trained mould There is the case where over-fitting, the maximum Characteristic Number N that model allows in type_max, intermediate parameters k, initial value and N_maxIt is identical, centre ginseng Number E, initial value and E_maxIt is identical.

It include two stages the present invention is based on the feature selection approach of Filter and Wrapper selection algorithms：Stage one wraps It includes and handles to obtain character subset using variance method and Pearson correlation coefficient method, the stage two includes the search using feature space The evaluation criterion of method, model-learning algorithm and character subset is handled, and the character subset of selection is finally obtained.

In an alternate embodiment of the present invention where, above-mentioned steps B calculates each feature in data set using variance method Mean value and variance, calculation formula are specially：

The feature that feature does not dissipate is rejected further according to the variance threshold values A being arranged in step A, specially：The side of judging characteristic Whether difference is less than the variance threshold values of setting；If so, this feature is deleted from data set；If it is not, then retaining this feature.Due to Feature after second stage can screen the first stage carries out further rejecting, in order to avoid weeding out the spy of feature diverging Sign, the present invention variance threshold values are arranged it is as small as possible, to reject variance be less than variance threshold values feature.

In an alternate embodiment of the present invention where, characteristic variable refers to that the input in learning model becomes in above-mentioned steps C Amount, target variable refers to the output variable in learning model.Feature after step B is screened is calculated using Pearson correlation coefficient method The Pearson correlation coefficient of variable and target variable, calculation formula are specially：

Wherein, cov (X^,Y the covariance of characteristic variable X and target variable Y, ρ) are indicated_X,YIndicate characteristic variable X and target The Pearson correlation coefficient of variable Y, σ_Xσ_YThe standard deviation of characteristic variable X and target variable Y are indicated respectively.

Redundancy feature is rejected further according to the correlation coefficient threshold B for the Pearson correlation coefficient method being arranged in step A, specifically For：Whether the Pearson correlation coefficient of judging characteristic and target variable is less than the correlation coefficient threshold of setting；If so, should Feature is deleted from data set；If it is not, then retaining this feature.Feature after being screened to the first stage due to second stage is carried out Further rejecting, so one correlation coefficient threshold B as small as possible of setting, weeds out the related coefficient spy smaller than threshold value Sign.Here the threshold value set need to be as small as possible, avoids rejecting some internal dependence-producing properties as redundancy feature.

In an alternate embodiment of the present invention where, above-mentioned steps D is sub by remaining whole features after step B and C screening Collection is input to improved LVW (Las Vegas Wrapper) feature selecting algorithm as complete feature (being denoted as M) space In.

In an alternate embodiment of the present invention where, above-mentioned steps E generates new feature using feature space searching method Subset is specially：New character subset is randomly generated using stochastic search methods, is denoted as H, and Characteristic Number is denoted as m；In order to reduce Cost is calculated, stop condition control parameter T is set, judges whether count parameter is more than the stop condition control parameter of setting；If It is then to carry out step I；If it is not, then carrying out next step.

In an alternate embodiment of the present invention where, above-mentioned steps F uses neural metwork training learning model, and the present invention is not The learning algorithm of limited model uses, including Bayesian network, genetic algorithm etc..

In an alternate embodiment of the present invention where, above-mentioned steps G using cross-validation method calculate that step E generates it is new The error that character subset generates in the learning model that step F is generated, is denoted as E_H, export subset and be denoted as H*.

In an alternate embodiment of the present invention where, above-mentioned steps H is using the error of current signature subset as evaluation criterion The error that comparison step G is obtained, specifically include it is following step by step：

H1, the error E for judging new character subset H_HWhether the error E of current signature subset is less than and new character subset Characteristic Number m be less than learning model allow maximum Characteristic Number N_max；If so, by count parameter t zero (t=0), when The error E of preceding character subset is set as the error E of new character subset_H(E=E_H), the Characteristic Number k setting of current signature subset For new character subset Characteristic Number m (k=m), new character subset H as the character subset H* (H*=H) currently selected； If it is not, then by count parameter t progressive 1 (t=t+1), return to step E；

H2, the error E for judging new character subset H_HWhether the error E of current signature subset is equal to and new character subset Characteristic Number m be less than current signature subset Characteristic Number k；If so, by count parameter t zeros (t=0), current spy The error E of sign subset is set as the error E of new character subset_H(E=E_H), the Characteristic Number k of current signature subset be set as new Character subset Characteristic Number m (k=m), new character subset H as the character subset H* (H*=H) currently selected；If It is no, then by count parameter t progressive 1 (t=t+1), return to step E；

In an alternate embodiment of the present invention where, above-mentioned steps I judges whether current signature subset H* is empty set；If It is that the progressive variance step-length a (A=A+a) of variance threshold values A, related coefficient are then passed into B into a related coefficient step-length b (B=B + b), return to step B；If it is not, then obtaining completing the character subset H* of selection.

Two benches feature selection approach proposed by the present invention, combines Filter selection algorithms and Wrapper selection algorithms The advantages of.Because the feature after being screened to the first stage is carried out further rejecting by second stage Wrapper selection algorithms, First stage Filter selection algorithm can be reduced, internal dependence-producing property is rejected into feelings as redundancy feature in the selection process The appearance of condition；The first stage rejects the feature and redundancy that feature does not dissipate using variance method and Pearson correlation coefficient method simultaneously Feature, the low disadvantage of Wrapper selection algorithm efficiency is made up with this.

Second stage of the present invention uses Wrapper selection algorithms, wherein random searching strategy search subset space is used, and The character subset evaluation searched for every time is required for training pattern, and in order to reduce calculating cost, the present invention uses in the first stage Variance method and Pearson correlation coefficient method carry out feature preliminary rejecting, and item is provided in the algorithm of second stage Part control parameter T.

Two stage selection method proposed by the present invention, the not use of limited model learning algorithm, including support vector machines, ant Group's algorithm etc..

Two benches feature selection approach proposed by the present invention, LVW (Las Vegas of the second stage in Wrapper patterns Wrapper) on the basis of selection algorithm, a restrictive condition parameter N is added_maxThe Characteristic Number of limitation output character subset, keeps away The case where exempting from the over-fitting that Wrapper selection algorithms are susceptible to.

Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This field Those of ordinary skill can make according to the technical disclosures disclosed by the invention various does not depart from the other each of essence of the invention The specific variations and combinations of kind, these variations and combinations are still within the scope of the present invention.

Claims

1. a kind of feature selection approach based on Filter and Wrapper selection algorithms, which is characterized in that include the following steps：

A, whole character subsets are imported, initial parameter is set；

C, the Pearson phase relations of characteristic variable and target variable after step B screenings are calculated using Pearson correlation coefficient method Number rejects redundancy feature；

D, using whole character subsets after step C screenings as complete feature space, using improved LVW feature selecting algorithms It is handled；

E, new character subset is generated using feature space searching method；

F, using neural metwork training learning model；

G, calculate what the new character subset that step E is generated generated in the learning model that step F is generated using cross-validation method Error；

I, judge whether current signature subset is empty set；If so, variance threshold values progressive variance step-length, a related coefficient are passed Into a related coefficient step-length, return to step B；If it is not, then obtaining completing the character subset of selection.

2. the feature selection approach as described in claim 1 based on Filter and Wrapper selection algorithms, which is characterized in that Calculating each mean value of feature and the calculation formula of variance in data set using variance method in the step B is specially：

3. the feature selection approach as claimed in claim 2 based on Filter and Wrapper selection algorithms, which is characterized in that The feature that feature does not dissipate is rejected in the step B is specially：Whether the variance of judging characteristic is less than the variance threshold values of setting；If It is then to delete this feature from data set；If it is not, then retaining this feature.

4. the feature selection approach as claimed in claim 3 based on Filter and Wrapper selection algorithms, which is characterized in that The Pearson phases of characteristic variable and target variable after step B screenings are calculated in the step C using Pearson correlation coefficient method The calculation formula of relationship number is specially：

Wherein, cov (X, Y) indicates the covariance of characteristic variable X and target variable Y, ρ_X,YIndicate characteristic variable X and target variable Y Pearson correlation coefficient, σ_Xσ_YThe standard deviation of characteristic variable X and target variable Y are indicated respectively.

5. the feature selection approach as claimed in claim 4 based on Filter and Wrapper selection algorithms, which is characterized in that Redundancy feature is rejected in the step C is specially：Whether the Pearson correlation coefficient of judging characteristic and target variable is less than setting Correlation coefficient threshold；If so, this feature is deleted from data set；If it is not, then retaining this feature.

6. the feature selection approach as claimed in claim 5 based on Filter and Wrapper selection algorithms, which is characterized in that The step E generates new character subset using feature space searching method：Using stochastic search methods, judge to count Whether parameter is more than the stop condition control parameter of setting；If so, carrying out step I；If it is not, then carrying out next step.

7. the feature selection approach as claimed in claim 6 based on Filter and Wrapper selection algorithms, which is characterized in that The error that the step H obtains the error of current signature subset as evaluation criterion comparison step G, specifically includes following substep Suddenly：

H1, judge whether the error of new character subset is less than the error of current signature subset and the feature of new character subset Number is less than the maximum Characteristic Number that learning model allows；If so, count parameter zero, the error of current signature subset are set It is set as the Characteristic Number, new of new character subset for the error of new character subset, the Characteristic Number of current signature subset Character subset is as current signature subset；If it is not, then by count parameter progressive 1, return to step E；

H2, judge whether the error of new character subset is equal to the error of current signature subset and the feature of new character subset Characteristic Number of the number less than current signature subset；If so, count parameter zero, the error of current signature subset are set as The error of new character subset, the Characteristic Number of current signature subset are set as the Characteristic Number of new character subset, new spy Subset is levied as current signature subset；If it is not, then by count parameter progressive 1, return to step E.