CN106169085A - Feature selection approach based on measure information - Google Patents
Feature selection approach based on measure information Download PDFInfo
- Publication number
- CN106169085A CN106169085A CN201610542270.4A CN201610542270A CN106169085A CN 106169085 A CN106169085 A CN 106169085A CN 201610542270 A CN201610542270 A CN 201610542270A CN 106169085 A CN106169085 A CN 106169085A
- Authority
- CN
- China
- Prior art keywords
- formula
- class label
- feature
- features
- make
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to machine learning, data mining technology field, for proposing a kind of feature selecting algorithm based on measure information, and whether there is the balance coefficient generally optimum to some data set performances by experiment show.The technical solution used in the present invention is, feature selection approach based on measure information, step is as follows: utilize feature fiSU (f with class label ci;C) He two features fi、ftThe three tunnel interactive information I (f with class label ci;ft;C) the two amount, building object function is:In above formula, fiFor the feature do not chosen, X is the feature set do not chosen, and c is class label, and D is for meeting I (fi;fs;C) maximum f more than zerosFeature set, fsIt is a feature just selected, ftFor the feature of D subset, β is balance coefficient.Present invention is mainly applied to machine learning, data mining occasion.
Description
Technical field
The invention belongs to machine learning, data mining technology field, relate to a kind of feature selection side based on measure information
Method.
Background technology
As a kind of important way of Dimensionality Reduction, feature selection is according to certain module, from primitive character
Choose a preferably subset as final feature, thus reduce intrinsic dimensionality.Calculate with study according to character subset module
The relation of method, feature selecting algorithm can be divided into filtering type (Filter), embedded (Embedded) and packaging type
(Wrapper).Three compares, and Embedded algorithm and packaging type algorithm characteristics Selection effect are good, but the most more;Filtering type algorithm
Feature selection effect is relatively poor, but time-consumingly few, compares and is suitable for being applied to High Dimensional Data Set.According to different modules, mistake
Filter formula algorithm can be divided into algorithm based on measure information, algorithm based on distance metric, algorithm based on consistency metric and
Algorithm based on subordinate tolerance.The present invention proposes a kind of feature selecting algorithm based on measure information.
For sake of convenience, only the feature selection being based only upon measure information is analyzed.For being based only upon measure information
Feature selection, people mainly launch research in terms of following two: a kind of is the feature selecting algorithm merely with mutual information.Its
In, utilizing the mutual information of feature and class label to weigh the mutual information between dependency and feature, to weigh the algorithm of redundancy more;Separately
A kind of is the feature selecting algorithm utilizing mutual information to combine three tunnel interactive information.Owing to the algorithm of the second situation exists mutual information
Do not have effectively to be combined with three tunnel interactive information and make the undesirable problem of feature selection effect, therefore to the second situation
Feature selection is studied.
Summary of the invention
For overcoming the deficiencies in the prior art, it is contemplated that propose a kind of feature selecting algorithm based on measure information, and
The balance coefficient generally optimum to some data set performances whether is there is by experiment show.The technical side that the present invention uses
Case is, feature selection approach based on measure information, and step is as follows: set X, Y, Z as three discrete random variables, the three of X, Y, Z
Road interactive information I (X;Y;Z) with the conditional mutual information I (X of X, Y, Z;And the mutual information I (X of X, Y Y/Z);Y) there is following relation:
I(X;Y;Z)=I (X;Y/Z)-I(X;Y) (7)
Use well-balanced uncertainty SU (Symmetrical Uncertainty) to mutual information normalization, feature fiWith class
The SU value of label c is as follows:
Wherein, H (fi) it is characterized fiEntropy, H (c) is the entropy of class label c, I (fi;C) it is characterized fiMutual with class label c
Information;
Utilize formula (7), make X=fi, Y=ft, Z=c, obtain formula (9):
I(fi;ft;C)=I (fi;ft/c)-I(fi;ft) (9)
Wherein, I (fi;ft;C) it is two features fi、ftWith the three tunnel interactive information of class label c, I (fi;ft/ c) it is in class
Two features f under label c known conditionsiWith ftMutual information, I (fi;ft) it is characterized fiWith ftMutual information;
Utilize feature fiSU (f with class label ci;C) He two features fi、ftThree tunnel interactive information I with class label c
(fi;ft;C) the two amount, building object function is:
In above formula, fiFor the feature do not chosen, X is the feature set do not chosen, and c is class label, and D is for meeting I (fi;fs;c)
The maximum f more than zerosFeature set, fsIt is a feature just selected, ftFor the feature of D subset, β is balance coefficient.
β takes one in 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0.
Concrete steps are further refined as,
Step 1: call WEKA software, uses minimum description length discrete method that data characteristics is carried out discretization;
Step 2: initializing S, D and X, making S, D is empty set, and X is all features of data set;
Step 3: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi),
P (x) is the probability density function of variable x;
Step 4: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label c;
Step 5: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information I of class label in X
(fi;C),
Wherein, p (x), p (y) are respectively variable X, the probability density function of Y, I (Y;X) it is the mutual information of Y and X, and p (x, y)
For variable X, the Joint Distribution probability density function of Y;
Step 6: utilize formula (8), calculates all features and the SU value of class label in X
Wherein, H (fi) it is characterized fiEntropy, H (c) is the entropy of class label c, I (fi;C) it is characterized fiMutual with class label c
Information;
Step 7: taking-up and class label have feature f of maximum SU value from Xi, put in S, make fsFor fi;
Step 8: making β is 0.1;
Step 9: make the X=f in formula (6)i, Y=ft, Z=c, utilize formula (6), calculate all features and class label c in X,
fsConditional information I (fi;ft/ c):
Wherein, (x, y, z) be the joint probability density function of X, Y and Z to p, and p (x, y/z) is the connection of X, Y under the conditions of Z=z
Closing probability density function, p (x/z) is the probability density function of X under the conditions of Z=z, and p (y/z) is the general of Y under the conditions of Z=z
Rate density function;
Step 10: make the X=f in formula (3)i, Y=ft, utilize formula (3), calculate all features and the mutual trust of class label in X
Breath I (fi;ft):
Step 11: utilize formula (9), calculates all features and class label c, f in XsThree tunnel interactive information;
I(fi;ft;C)=I (fi;ft/c)-I(fi;ft) (9)
Wherein, I (fi;ft;C) it is two features fi、ftWith the three tunnel interactive information of class label c, I (fi;ft/ c) it is in class
Two features f under label c known conditionsiWith ftMutual information, I (fi;ft) it is characterized fiWith ftMutual information;
Step 12: if the maximum of three tunnel interactive information is more than zero, carry out step 13, step 14, step 15, step
16, step 17, step 18 and step 19;Otherwise, step 20, step 21, step 22, step 23 and step 24 are carried out;
Step 13: by fsPut in D;
Step 14: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi);
Step 15: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label;
Step 16: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information of class label in X
I(fi;c).
Step 17: utilize formula (8), calculates all features and the SU value of class label in X;
Step 18: utilize the result that formula (8) and formula (9) obtain, calculating formula (10):
Step 19: take out feature f that formula (10) is set up from Xi, put in S, make fsFor fi;
Step 20: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi);
Step 21: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label;
Step 22: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information of class label in X
I(fi;c);
Step 23: utilize formula (8), calculates all features and the SU value of class label in X;
Step 24: taking-up and class label have feature f of maximum SU value from Xi, put in S, make fsFor fi;
Step 25: repeatedly carry out step 9, step 10, step 11 and step 12, until selecting | S | feature, S is this calculation
The character subset that method is chosen, | S | is characterized the number of subset, and N takes 30;When the characteristic number of data set is more than 30, | S | takes 30,
Other data sets, | S | fetch data collection characteristic number, feature put into the order in S be i.e. this algorithm characteristics select order;
Step 26: make β be respectively 0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 and 1.0, carry out step 9, step
10, step 11, step 12 and step 25.
Also include step 27: utilize WEKA software, the performance of selected characteristic tested, comprises the concrete steps that:
Step 27.1: utilize WEKA software, chooses first 1 in S, first 2 ..., front | S | feature;
Step 27.2: use C4.5 grader and ten folding cross validation methods that the feature chosen is tested;
Step 27.3: often group experiment is all carried out 10 times, takes its meansigma methods as final result, takes often accuracy rate in group result
The highest
Characteristic of correspondence number is the number of final selected characteristic;
Step 27.4: be changed to be based only on an arest neighbors example, partial decision tree by the C4.5 grader in step 27.2
Middle acquisition is advised
Then (PART) and naive Bayesian (Bayesian) grader, carries out step 27.1, step 27.2 and step
Rapid 27.3.
The feature of the present invention and providing the benefit that:
1) present invention proposes a kind of feature selecting algorithm based on measure information, with the calculation of existing some other same type
Method is compared, and the algorithm of the present invention has a certain upgrade in feature selection effect;
2) present invention attempted the SU value to feature and class label and two features and class label three tunnel interactive information this
Two amounts are weighed, and verify whether that there is the balance to some data set performances are generally optimum joins by experimental result
Number;
3) feature selecting algorithm that the present invention proposes has certain using value, and this algorithm can be applied to such as digitized map
As fields such as process and computer visions.
Accompanying drawing illustrates:
Fig. 1 gives the block diagram based on measure information feature selecting algorithm that the present invention proposes;
It is 0.2 that Fig. 2 gives β, the feature selection result of four kinds of graders of Vehicle data set,
Wherein, (a) is the feature selection result of C4.5 grader;B () is the feature selection result of IB1 grader;(c) be
The feature selection result of PART grader;(d) beThe feature selection result of Bayesian grader;
It is 0.2 that Fig. 3 gives β, the feature selection result of four kinds of graders of Movement_libras data set.
Wherein, (a) is the feature selection result of C4.5 grader;B () is the feature selection result of IB1 grader;(c) be
The feature selection result of PART grader;(d) beThe feature selection result of Bayesian grader.
Detailed description of the invention
The present invention first mutual information to class label Yu feature carries out well-balanced uncertainty (Symmetrical
Uncertainty, SU) normalization;Then to the SU value after normalization with and three tunnels of class label and two features believe alternately
Breath carries out certain balance;Propose a kind of feature selecting algorithm based on measure information, and whether deposited by experiment show
At the balance coefficient generally optimum to some data set performances.
The present invention utilizes class label and the SU value of feature and class label and three tunnel interactive information the two amounts of two features,
Propose a kind of feature selecting algorithm based on measure information.Details are as follows for concrete technical scheme:
The background knowledge of 1.1 measure informations
Convenient for statement, only process discrete random variable.Assuming that X is a discrete random variable, p (x) is the general of this variable
Rate density function.Comentropy is often used to state the size of obtained quantity of information, and comentropy H (X) can be expressed as:
For obey Joint Distribution be p (x, variable x y) and variable y, its combination entropy H (X, Y) can be expressed as:
Mutual information is often used to quantify the common information that two variablees are comprised.Mutual information I (the X of X and Y;Y) can represent
For:
I(X;Y)=I (Y;X) (4)
Wherein, p (x), p (y) are respectively variable X, the probability density function of Y, I (Y;X) it is the mutual information of Y and X.
Mutual information I (the X of X and Y;Y) there is following relation with the entropy H (X) of X, the entropy H (Y) of Y and combination entropy H (X, Y):
I(X;Y)=H (X)+H (Y)-H (X, Y) (5)
Conditional mutual information is used to quantify under a variable known case, the common information that two other variable is comprised.I
(X;Y/Z) can be expressed as:
Wherein, (x, y, z) be the joint probability density function of X, Y and Z to p, and p (x, y/z) is the connection of X, Y under the conditions of Z=z
Closing probability density function, p (x/z) is the probability density function of X under the conditions of Z=z, and p (y/z) is the general of Y under the conditions of Z=z
Rate density function.
Three tunnel interactive information are the extensions of mutual information, the three tunnel interactive information I (X of three discrete random variables X, Y, Z;Y;
Z) with the conditional mutual information I (X of X, Y, Z;And the mutual information I (X of X, Y Y/Z);Y) there is following relation:
I(X;Y;Z)=I (X;Y/Z)-I(X;Y) (7)
1.2 feature selecting algorithm based on measure information
Feature and the mutual information I (f of class labeli;C), this feature f is showniDegree of correlation with class label c.Association relationship is more
Greatly, show that feature is the most relevant to class label.
Two features and three tunnel interactive information I (f of class labeli;fs;C) size, is from the two feature fiAnd fsIn
That obtain with class label c information with respectively from feature fiOr fsIn obtain with class label information and size.As I (fi;fs;
C) > 0 time, show two features fiAnd fsThere is synergism, i.e. from fiAnd fsIn obtain with the information of class label more than individually from
fiOr fsIn obtain and the summation of class label information.As I (fi;fs;C) < when 0, two features f are showniAnd fsThere is redundancy, i.e. from fi
And fsIn obtain with the information of class label less than individually from fiOr fsIn the summation with class label information that obtains.
As can be seen from the above: feature and the mutual information I (f of class labeli;C) value is the biggest, two features and the three of class label
Road interactive information I (fi;fs;C) more than zero, this situation can obtain more information from selection feature.Feature and class label
Mutual information I (fi;C) value is the least, two features and three tunnel interactive information I (f of class labeli;fs;C) value is less than zero, this feelings
Condition then obtains less information from selection feature.Other two kinds of situations are from selecting the information obtained feature between both the above
Between situation acquisition information.
If employing mutual information, feature selection process can preferentially choose the feature that association relationship is big, and association relationship is big
Feature is not necessarily marked feature, therefore have employed well-balanced uncertainty (SU) to mutual information normalization, feature fiWith class label c's
SU value is as follows:
Wherein, H (fi) it is characterized fiEntropy, H (c) is the entropy of class label c, I (fi;C) it is characterized fiMutual with class label c
Information.
Utilize formula (7), make X=fi, Y=ft, Z=c, obtain formula (9)
I(fi;ft;C)=I (fi;ft/c)-I(fi;ft) (9)
Wherein, I (fi;ft;C) it is two features fi、ftWith the three tunnel interactive information of class label c, I (fi;ft/ c) it is in class
Two features f under label c known conditionsiWith ftMutual information, I (fi;ft) it is characterized fiWith ftMutual information.
Therefore, feature f is utilizediSU (f with class label ci;C) He two features fi、ftBelieve alternately with three tunnels of class label c
Breath I (fi;ft;C) the two amount, proposes a kind of feature selecting algorithm based on measure information.The object function of this algorithm is:
In above formula, fiFor the feature do not chosen, X is the feature set do not chosen, and c is class label, and D is for meeting I (fi;fs;c)
The maximum f more than zerosFeature set, fsIt is a feature just selected, ftFor the feature of D subset, β is balance coefficient.General next
Saying, comparing with three tunnel interactive information of class label with two features, feature is even more important with the SU value of class label.For simplifying, β takes
0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0 this 10 number.
The flow chart of this algorithm is as follows:
Input: N: selected characteristic number;
Output: S: the feature after sequence.
Below in conjunction with algorithm block diagram and algorithm flow chart, the present invention is described in detail.
As it is shown in figure 1, the present invention provides a kind of feature selecting algorithm based on measure information, transport under matlab platform
OK.It comprises the following steps:
Step 1: call WEKA software, uses minimum description length discrete method that data characteristics is carried out discretization.
Step 2: initialize S, D and X.Making S, D is empty set, and X is all features of data set.
Step 3: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi)。
Step 4: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label.
Step 5: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information I of class label in X
(fi;c).
Step 6: utilize formula (8), calculates all features and the SU value of class label in X.
Step 7: taking-up and class label have feature f of maximum SU value from Xi, put in S, make fsFor fi。
Step 8: making β is 0.1.
Step 9: make the X=f in formula (6)i, Y=ft, Z=c, utilize formula (6), calculate all features and class label c in X,
fsConditional information I (fi;ft/c)。
Step 10: make the X=f in formula (3)i, Y=ft, utilize formula (3), calculate all features and the mutual trust of class label in X
Breath I (fi;ft)。
Step 11: utilize formula (9), calculates all features and class label c, f in XsThree tunnel interactive information.
Step 12: if the maximum of three tunnel interactive information is more than zero, carry out step 13, step 14, step 15, step
16, step 17, step 18 and step 19;Otherwise, step 20, step 21, step 22, step 23 and step 24 are carried out.
Step 13: by fsPut in D.
Step 14: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi)。
Step 15: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label.
Step 16: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information of class label in X
I(fi;c).
Step 17: utilize formula (8), calculates all features and the SU value of class label in X.
Step 18: utilize the result that formula (8) and formula (9) obtain, calculating formula (10).
Step 19: take out feature f that formula (10) is set up from Xi, put in S, make fsFor fi。
Step 20: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi)。
Step 21: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label.
Step 22: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information of class label in X
I(fi;c).
Step 23: utilize formula (8), calculates all features and the SU value of class label in X.
Step 24: taking-up and class label have feature f of maximum SU value from Xi, put in S, make fsFor fi。
Step 25: repeatedly carry out step 9, step 10, step 11 and step 12, until selecting | S | feature, S is this calculation
The character subset that method is chosen, | S | is characterized the number of subset, and N takes 30.When the characteristic number of data set is more than 30, | S | takes 30.
Other data sets, | S | fetch data collection characteristic number.It is i.e. the order that this algorithm characteristics selects that feature puts into the order in S.
Step 26: make β be respectively 0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 and 1.0, carry out step 9, step
10, step 11, step 12 and step 25.
Step 27: utilize WEKA software, tests the performance of selected characteristic.
Step 27.1: utilize WEKA software, chooses first 1 in S, first 2 ..., front | S | feature.
Step 27.2: use C4.5 grader and ten folding cross validation methods that the feature chosen is tested.
Step 27.3: often group experiment is all carried out 10 times, takes its meansigma methods as final result, takes often accuracy rate in group result
The highest characteristic of correspondence number is the number of final selected characteristic.
Step 27.4: be changed to the C4.5 grader in step 27.2 be based only on an arest neighbors example (Instance
Base 1, IB1), partial decision tree obtains rule (PART) and naive Bayesian (Bayesian) grader, is carried out
Step 27.1, step 27.2 and step 27.3.
Claims (4)
1. a feature selection approach based on measure information, is characterized in that, step is as follows: set X, Y, Z as three Discrete Stochastic
Variable, the three tunnel interactive information I (X of X, Y, Z;Y;Z) with the conditional mutual information I (X of X, Y, Z;And the mutual information I (X of X, Y Y/Z);
Y) there is following relation:
I(X;Y;Z)=I (X;Y/Z)-I(X;Y) (7)
Use well-balanced uncertainty SU (Symmetrical Uncertainty) to mutual information normalization, feature fiWith class label c
SU value as follows:
Wherein, H (fi) it is characterized fiEntropy, H (c) is the entropy of class label c, I (fi;C) it is characterized fiMutual information with class label c;
Utilize formula (7), make X=fi, Y=ft, Z=c, obtain formula (9):
I(fi;ft;C)=I (fi;ft/c)-I(fi;ft) (9)
Wherein, I (fi;ft;C) it is two features fi、ftWith the three tunnel interactive information of class label c, I (fi;ft/ c) it is at class label c
Two features f under known conditionsiWith ftMutual information, I (fi;ft) it is characterized fiWith ftMutual information;
Utilize feature fiSU (f with class label ci;C) He two features fi、ftThe three tunnel interactive information I (f with class label ci;ft;
C) the two amount, building object function is:
In above formula, fiFor the feature do not chosen, X is the feature set do not chosen, and c is class label, and D is for meeting I (fi;fs;C) maximum
The value f more than zerosFeature set, fsIt is a feature just selected, ftFor the feature of D subset, β is balance coefficient.
2. feature selection approach based on measure information as claimed in claim 1, is characterized in that, β takes 0.1,0.2,0.3,
0.4, in 0.5,0.6,0.7,0.8,0.9,1.0.
3. feature selection approach based on measure information as claimed in claim 1, is characterized in that, concrete steps refine further
For,
Step 1: call WEKA software, uses minimum description length discrete method that data characteristics is carried out discretization;
Step 2: initializing S, D and X, making S, D is empty set, and X is all features of data set;
Step 3: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi),
P (x) is the probability density function of variable x;
Step 4: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label c;
Step 5: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information I (f of class label in Xi;
C),
Wherein, p (x), p (y) are respectively variable X, the probability density function of Y, I (Y;X) being the mutual information of Y and X, (x, y) for becoming for p
The Joint Distribution probability density function of amount X, Y;
Step 6: utilize formula (8), calculates all features and the SU value of class label in X
Wherein, H (fi) it is characterized fiEntropy, H (c) is the entropy of class label c, I (fi;C) it is characterized fiMutual information with class label c;
Step 7: taking-up and class label have feature f of maximum SU value from Xi, put in S, make fsFor fi;
Step 8: making β is 0.1;
Step 9: make the X=f in formula (6)i, Y=ft, Z=c, utilize formula (6), calculate all features and class label c, f in Xs's
Conditional information I (fi;ft/ c):
Wherein, (x, y, z) be the joint probability density function of X, Y and Z to p, and p (x, y/z) is that the associating of X, Y is general under the conditions of Z=z
Rate density function, p (x/z) is the probability density function of X under the conditions of Z=z, and p (y/z) is that the probability of Y is close under the conditions of Z=z
Degree function;
Step 10: make the X=f in formula (3)i, Y=ft, utilize formula (3), calculate all features and the mutual information I of class label in X
(fi;ft):
Step 11: utilize formula (9), calculates all features and class label c, f in XsThree tunnel interactive information;
I(fi;ft;C)=I (fi;ft/c)-I(fi;ft) (9)
Wherein, I (fi;ft;C) it is two features fi、ftWith the three tunnel interactive information of class label c, I (fi;ft/ c) it is at class label c
Two features f under known conditionsiWith ftMutual information, I (fi;ft) it is characterized fiWith ftMutual information;
Step 12: if the maximum of three tunnel interactive information more than zero, carry out step 13, step 14, step 15, step 16,
Step 17, step 18 and step 19;Otherwise, step 20, step 21, step 22, step 23 and step 24 are carried out;
Step 13: by fsPut in D;
Step 14: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi);
Step 15: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label;
Step 16: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information I (f of class label in Xi;
c)。
Step 17: utilize formula (8), calculates all features and the SU value of class label in X;
Step 18: utilize the result that formula (8) and formula (9) obtain, calculating formula (10):
Step 19: take out feature f that formula (10) is set up from Xi, put in S, make fsFor fi;
Step 20: make the X=f in formula (1)i, utilize formula (1) to calculate the entropy H (f of all features in Xi);
Step 21: make the X=c in formula (1), utilizes formula (1) to calculate entropy H (c) of class label;
Step 22: make the X=f in formula (3)i, Y=c, utilize formula (3) to calculate all features and the mutual information I (f of class label in Xi;
c);
Step 23: utilize formula (8), calculates all features and the SU value of class label in X;
Step 24: taking-up and class label have feature f of maximum SU value from Xi, put in S, make fsFor fi;
Step 25: repeatedly carry out step 9, step 10, step 11 and step 12, until selecting | S | feature, S is the choosing of this algorithm
The character subset taken, | S | is characterized the number of subset, and N takes 30;When the characteristic number of data set is more than 30, | S | takes 30, other
Data set, | S | fetch data collection characteristic number, feature put into the order in S be i.e. this algorithm characteristics select order;
Step 26: make β be respectively 0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 and 1.0, carries out step 9, step 10, step
Rapid 11, step 12 and step 25.
4. feature selection approach based on measure information as claimed in claim 3, is characterized in that, also include step 27: utilize
WEKA software, tests the performance of selected characteristic, comprises the concrete steps that:
Step 27.1: utilize WEKA software, chooses first 1 in S, first 2 ..., front | S | feature;
Step 27.2: use C4.5 grader and ten folding cross validation methods that the feature chosen is tested;
Step 27.3: often group experiment is all carried out 10 times, takes its meansigma methods as final result, take often in group result accuracy rate the highest
Characteristic of correspondence number is the number of final selected characteristic;
Step 27.4: be changed to be based only in an arest neighbors example, partial decision tree by the C4.5 grader in step 27.2 and obtain
Take rule (PART) and naive Bayesian (Bayesian) grader, carries out step 27.1, step 27.2 and step
27.3。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610542270.4A CN106169085A (en) | 2016-07-11 | 2016-07-11 | Feature selection approach based on measure information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610542270.4A CN106169085A (en) | 2016-07-11 | 2016-07-11 | Feature selection approach based on measure information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106169085A true CN106169085A (en) | 2016-11-30 |
Family
ID=58064881
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610542270.4A Pending CN106169085A (en) | 2016-07-11 | 2016-07-11 | Feature selection approach based on measure information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106169085A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135508A (en) * | 2019-05-21 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Model training method, device, electronic equipment and computer readable storage medium |
CN110298398A (en) * | 2019-06-25 | 2019-10-01 | 大连大学 | Wireless protocols frame feature selection approach based on improved mutual imformation |
CN110942149A (en) * | 2019-10-31 | 2020-03-31 | 河海大学 | Feature variable selection method based on information change rate and condition mutual information |
-
2016
- 2016-07-11 CN CN201610542270.4A patent/CN106169085A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135508A (en) * | 2019-05-21 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Model training method, device, electronic equipment and computer readable storage medium |
CN110135508B (en) * | 2019-05-21 | 2022-11-29 | 腾讯科技(深圳)有限公司 | Model training method and device, electronic equipment and computer readable storage medium |
CN110298398A (en) * | 2019-06-25 | 2019-10-01 | 大连大学 | Wireless protocols frame feature selection approach based on improved mutual imformation |
CN110298398B (en) * | 2019-06-25 | 2021-08-03 | 大连大学 | Wireless protocol frame characteristic selection method based on improved mutual information |
CN110942149A (en) * | 2019-10-31 | 2020-03-31 | 河海大学 | Feature variable selection method based on information change rate and condition mutual information |
CN110942149B (en) * | 2019-10-31 | 2020-09-22 | 河海大学 | Feature variable selection method based on information change rate and condition mutual information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Feature selection using rough entropy-based uncertainty measures in incomplete decision systems | |
CN103257921B (en) | Improved random forest algorithm based system and method for software fault prediction | |
CN102298579A (en) | Scientific and technical literature-oriented model and method for sequencing papers, authors and periodicals | |
Duncan et al. | Stochastic integration for fractional Brownian motion in a Hilbert space | |
CN105183796A (en) | Distributed link prediction method based on clustering | |
CN106169085A (en) | Feature selection approach based on measure information | |
Huang et al. | An extended nonmonotone line search technique for large-scale unconstrained optimization | |
CN102147727A (en) | Method for predicting software workload of newly-added software project | |
CN106327340A (en) | Method and device for detecting abnormal node set in financial network | |
CN105825430A (en) | Heterogeneous social network-based detection method | |
Bohmann et al. | Constructing equivariant spectra via categorical Mackey functors | |
CN108509388A (en) | Feature selection approach based on maximal correlation minimal redundancy and sequence | |
CN108345567A (en) | Feature selection approach based on conditional mutual information | |
Vallecillo et al. | Adding Random Operations to OCL. | |
CN105447241B (en) | A kind of ESOP of logical function of Digital Logical Circuits minimizes method | |
CN105138527A (en) | Data classification regression method and data classification regression device | |
CN109801073A (en) | Risk subscribers recognition methods, device, computer equipment and storage medium | |
CN105824937A (en) | Attribute selection method based on binary system firefly algorithm | |
CN103279549B (en) | A kind of acquisition methods of target data of destination object and device | |
CN104899283A (en) | Frequent sub-graph mining and optimizing method for single uncertain graph | |
CN104657473A (en) | Large-scale data mining method capable of guaranteeing quality monotony | |
CN105022798A (en) | Categorical data mining method of discrete Bayesian network on the basis of prediction relationship | |
Qu et al. | Meta‐modeling of fractional constitutive relationships for rocks based on physics‐induced machine learning | |
CN104866588A (en) | Frequent sub-graph mining method aiming at individual uncertain graph | |
Atalay et al. | Circular success and failure runs in a sequence of exchangeable binary trials |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161130 |
|
RJ01 | Rejection of invention patent application after publication |