CN105975589A - Feature selection method and device of high-dimension data - Google Patents
Feature selection method and device of high-dimension data Download PDFInfo
- Publication number
- CN105975589A CN105975589A CN201610298079.XA CN201610298079A CN105975589A CN 105975589 A CN105975589 A CN 105975589A CN 201610298079 A CN201610298079 A CN 201610298079A CN 105975589 A CN105975589 A CN 105975589A
- Authority
- CN
- China
- Prior art keywords
- feature
- value
- mic
- maximum information
- information coefficient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Abstract
The invention discloses a feature selection method and device of high-dimension data. The method comprises the following steps: obtaining an original data set to be processed, wherein the original data set comprises a feature set, a plurality of samples and a category set, and the category set comprises the category of each sample; calculating to obtain a MIC (Maximum Information Coefficient) between each feature in the feature set and the category set, and the redundant value of each feature and a selected feature subset; and according to the MIC and the redundant value, obtaining the effective value of each feature, and selecting the feature subset from the feature set according to the effective value. The MIC is introduced into feature selection, and the feature is effectively evaluated on the basis of the MIC so as to select features according to the effective value generated by evaluation. Compared with the prior art, the feature selection method can effectively improve accuracy for high-dimension data feature selection.
Description
Technical field
The present invention relates to data mining technology field, be specifically related to feature selection approach and the device of a kind of high dimensional data.
Background technology
The information-intensive society developed rapidly all is producing the data of magnanimity every day, excavates the most rapidly and have from these data
Information become urgent problem.Researchers solve this problem from the angle of machine learning model, and obtain
Remarkable break-throughs.But, the model of high complexity and high-dimensional feature space are increasingly difficult in adapt to compeling of big market demand
Highly necessary ask, and feature space often also exists a large amount of garbage.Only use suitable feature selection approach, Cai Nengcong
Mass data obtains effective feature, and then improves efficiency and the accuracy rate of machine learning model process data;Feature simultaneously
Selection can also prevent model over-fitting and carry out denoising.Important accordingly, as of machine learning and data mining
Pre-treatment step, feature selection always is the study hotspot in machine learning field.
Choosing of the module of feature selection and searching algorithm is most important.Conventional module have based on distance,
Theory of information and conforming module.Module based on distance, Pearson coefficient isometry standard can only weigh variable
Between linear relationship, and information gain, mutual information isometry standard, non-linear relation can be measured.Generating spy
When levying subset, generally require the corresponding searching algorithm of use, numerous search strategys approximates Markov blanket condition at meter
All well and good performance is had on the classification accuracy of the feature calculating complexity and selection.But it also has obvious shortcoming, it is impossible to
Consider the redundancy between feature and character subset.
Summary of the invention
For defect of the prior art, the invention provides feature selection approach and the device of a kind of high dimensional data, pin
MIC can only be incorporated in feature selection by the tolerance in current techniques by the linear processes relationship metric between variable,
MIC can linear processes relation between gauge variable widely, even can measure and single function can not be used to represent
Non-functional relation.Although MIC is largely effective on variable metric, but the dependency between single variable and redundancy can only be measured
Property, therefore set forth herein a kind of new tolerance mMIC (virtual value), and be applied to Markov blanket condition, to solve prior art
Cause feature selection degree of accuracy low because being difficult to be suitable for the redundancy between feature and the character subset that high dimensional data is concentrated
Problem.
The present invention proposes the feature selection approach of a kind of high dimensional data, including:
Obtaining pending raw data set, described raw data set includes feature set, some samples and classification collection, institute
State classification collection and include the classification of each sample;
Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition, and each
Individual feature and the redundancy value selecting character subset;
According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and according to described
Virtual value selects character subset from feature set.
Preferably, described calculating obtains the maximum information coefficient in described feature set between each feature and classification collection
The step of MIC specifically includes:
By formula (one), calculate the maximum information coefficient between each feature and classification collection in the described feature set of acquisition
MIC;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is
The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model
Value after the mutual information normalization of lower maximum.
Preferably, described according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature
Step specifically include:
By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature
Value;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi
And fj, c is classification collection,For redundancy value.
Preferably, described according to described maximum information coefficient MIC and described redundancy value, the effective of each feature is obtained
Before the step of value, the method also includes:
Define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature
Value, and from feature set, select the step of character subset according to described virtual value and specifically include:
According to described maximum information coefficient MIC selected characteristic successively from feature set, and by the feature chosen from feature set
Middle deletion;
Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and judge institute
State whether virtual value is more than or equal to predetermined threshold value, the most then this feature is added to optimal subset.
Preferably, described according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature
Value, and from feature set, select the step of character subset according to described virtual value and also include:
Filter out from feature set according to described approximation Markov blanket condition and have approximation Ma Er with the described feature chosen
All features of section husband blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;
Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will
The feature filtered out is deleted from feature set, and chooses next feature from feature set.
The invention allows for the feature selection device of a kind of high dimensional data, it is characterised in that including:
Acquisition module, for obtaining pending raw data set, described raw data set includes feature set, some samples
And classification collection, described classification collection includes the classification of each sample;
Processing module, for calculating the maximum information coefficient obtaining in described feature set between each feature and classification collection
MIC, and each feature and the redundancy value having selected character subset;
Select module, for according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature
Value, and from feature set, select character subset according to described virtual value.
Preferably, described processing module, specifically for by formula (one), calculate and obtain each spy in described feature set
Levy the maximum information coefficient MIC between classification collection;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is
The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model
Value after the mutual information normalization of lower maximum.
Preferably, described selection module, specifically for by formula (two), according to described maximum information coefficient MIC and institute
State redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi
And fj, c is classification collection,For redundancy value.
Preferably, this device also includes: predefined module;
Described predefined module, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains each
Before the step of the virtual value of individual feature, define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described selection module, it is additionally operable to from feature set, choose spy successively according to described maximum information coefficient MIC
Levy, and the feature chosen is deleted from feature set;Maximum information coefficient MIC and redundancy value according to the feature chosen obtain institute
State the virtual value of feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added extremely by described virtual value
Optimal subset.
Preferably, described selection module, it is additionally operable to filter out from feature set according to described approximation Markov blanket condition
There are all features approximating Markov blanket condition with the described feature chosen, and obtain what each filtered out according to formula two
The virtual value of feature;Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not,
Then the feature filtered out is deleted from feature set, and from feature set, choose next feature
As shown from the above technical solution, the feature selection approach of the high dimensional data that the present invention proposes, by maximum information system
Number is incorporated in feature selection, is simultaneously based on maximum information and high dimensional data is carried out feature selection, only to overcome prior art
Dependency and the shortcoming of redundancy between two features can be considered, improve the classification accuracy of the feature of selection.
Accompanying drawing explanation
By being more clearly understood from the features and advantages of the present invention with reference to accompanying drawing, accompanying drawing is schematic and should not manage
Solve as the present invention is carried out any restriction, in the accompanying drawings:
Fig. 1 shows the schematic flow sheet of the feature selection approach of a kind of high dimensional data that one embodiment of the invention proposes;
Fig. 2 shows the flow process signal of the feature selection approach of a kind of high dimensional data that another embodiment of the present invention proposes
Figure;
Fig. 3 shows the structural representation of the feature selection device of a kind of high dimensional data that one embodiment of the invention proposes.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
A part of embodiment of the present invention rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people
The every other embodiment that member is obtained on the premise of not making creative work, broadly falls into the scope of protection of the invention.
Fig. 1 is the schematic flow sheet of the feature selection approach of a kind of high dimensional data that one embodiment of the invention proposes, reference
Fig. 1, the feature selection approach of this high dimensional data, including:
110, obtaining pending raw data set, described raw data set includes feature set, some samples and classification
Collection, described classification collection includes the classification of each sample;
120, the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition is calculated, and
Each feature and the redundancy value selecting character subset;
130, according to described maximum information coefficient MIC and described redundancy value, the virtual value of each feature is obtained, and according to
Described virtual value selects character subset from feature set.
The present invention is incorporated in feature selection by maximum information coefficient, is simultaneously based on maximum information and carries out high dimensional data
Feature selection, because it is difficult to be suitable for the redundancy between feature and the character subset that high dimensional data is concentrated and cause feature selection essence
The problem that exactness is low, improves the classification accuracy of the feature of selection.
In the present embodiment, the process calculating MIC in step 120 specifically includes:
By formula (one), calculate the maximum information coefficient between each feature and classification collection in the described feature set of acquisition
MIC;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is
The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model
Value after the mutual information normalization of lower maximum.
In the present embodiment, step 130 specifically includes:
By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature
Value;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi
And fj, c is classification collection,For redundancy value.
In the present embodiment, before step 130, the method also includes:
Define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, step 130 specifically includes:
According to described maximum information coefficient MIC selected characteristic successively from feature set, and by the feature chosen from feature set
Middle deletion;
Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and judge institute
State whether virtual value is more than or equal to predetermined threshold value, the most then this feature is added to optimal subset;
Filter out from feature set according to described approximation Markov blanket condition and have approximation Ma Er with the described feature chosen
All features of section husband blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;
Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will
The feature filtered out is deleted from feature set, and chooses next feature from feature set, until feature set F is empty.
Fig. 2 is the schematic flow sheet of the feature selection approach of a kind of high dimensional data that another embodiment of the present invention proposes, under
The principle of the present invention is described in detail by face with reference to Fig. 2:
The method includes that initial phase and feature delete the stage;
One, initial phase includes:
S1, given data set D have m feature and n sample, and its feature set comprised is F={f1,f2,...,fm,
Classification collection c={c1,c2,...,cnInclude the classification of each sample in data set.Carry out data prediction, optimal characteristics is set
Subset S is empty, and setup parameter θ, parameter θ herein is above-mentioned predetermined threshold value;
Two, feature is deleted the stage and is comprised step:
S2, calculate the maximum information coefficient between each feature and classification collection in feature set, and according to feature and classification collection
MIC (c;fi) value carries out descending sort, wherein, f to featureiFor ith feature, i is more than 0 and less than or equal to m;
S3, propose approximation Markov blanket condition and virtual value mMIC evaluation function according to the present invention, feature set is entered
Row processes, and deletes unrelated and redundancy feature, obtains last character subset;
Preferably, step S1 specifically includes:
S11, data set D is carried out data prediction, obtain the file format of requirement;
S12, optimal feature subset S is initialized as empty set, parameter θ is initialized;
Preferably, step S2 specifically includes:
S21, to arbitrary characteristics f in feature set Fi, calculate the maximum information coefficient value MIC between this feature and classification collection
(c;fi);
S22, according to MIC (c;fi) feature is carried out descending sort;
Preferably, the approximation Markov blanket conditional definition described in step S3 is as follows:
For two features fiAnd fj(i ≠ j, j are more than 0 and less than or equal to m) and classification c, fiIt is fjApproximation Ma Erke
The condition of husband's blanket is:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)。
Thus, the computing formula of maximum information coefficient is as follows:
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1.Usually, B (n)=n0 . 6
Shi Xiaoguo is best.X Yu y represents the hop count dividing two variable codomains.M (D) in formulax,yRepresent that two variablees are drawn at x*y grid
The value after mutual information normalization maximum under point.
M(D)x,yComputing formula as follows:
Wherein, MI*(D, x y) represent the mutual information of maximum under x*y stress and strain model.
MI*(D, x, computing formula y) is as follows:
MI*(D, x, y)=maxMI (D | G)
Wherein, D | G is that data set D uses G (x*y grid) to divide, and then solves the mutual information of each grid.And formula
The computing formula of middle mutual information is as follows:
Wherein A={ai, i=1...n} and B={bi, i=1...n}.
Preferably, evaluation function mMIC based on maximum information coefficient in step S3, can be between feature and classification
Dependency between dependency and feature and character subset is measured, and then the quality of judging characteristic.
The computing formula of mMIC evaluation function is as follows:
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset.On simplifying and stating
Convenience uses i and j to represent feature f respectivelyiAnd fj.Above formula represents feature f selected from residue character subsetjIts quality is passed through
This feature determines with the current redundancy having character subset with the dependency of classification collection and this feature.
Preferably, step S3 comprises step:
S31, repetition operations described below are until F is empty set;
A. from feature set F, select MIC (c;fi) the maximum feature of value;
B. feature f is deleted from feature set FiIf it is in redundancy subset SreIn, then calculate the mMIC value of this feature, if
MMIC value is less than θ, returns to step a;Otherwise directly by fiAdd in optimal subset S, and by fiContinue executing with as host element
Step c;
C. from feature set F, search for the host element f to select in aiFor approximating all elements of Markov blanket condition, will
Feature f selectedjJoin SreIn and calculate the mMIC value of all elements selected.If feature fjMMIC value less than θ then
By feature fjDelete from F;
D., after said process terminates, the character subset S of output is optimal feature subset.
In sum, the present invention by joining in approximation Markov blanket model by mMIC so that approximation Markov
Blanket condition can weigh the power of the redundancy between the dependency between single feature and classification and this feature and character subset,
Determine the going or staying of feature.Both ensure that approximation Markov blanket condition carried out the efficiency of feature selection and also ensure that the spy selected
Levy the accuracy of selection.
Fig. 3 is the structural representation of the feature selection device of a kind of high dimensional data that one embodiment of the invention proposes, reference
Fig. 3, this device includes:
Acquisition module 310, for obtaining pending raw data set, described raw data set includes feature set, some
Sample and classification collection, described classification collection includes the classification of each sample;
Processing module 320, for calculating the maximum information obtaining in described feature set between each feature and classification collection
Coefficient MIC, and each feature and the redundancy value having selected character subset;
Select module 330, for according to described maximum information coefficient MIC and described redundancy value, obtain each feature
Virtual value, and from feature set, select character subset according to described virtual value.
The present invention is incorporated in feature selection by maximum information coefficient, is simultaneously based on maximum information and carries out high dimensional data
Feature selection, can only consider the shortcoming of dependency and redundancy between two features overcoming prior art, improve selection
The classification accuracy of feature.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple,
Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, described processing module 320, specifically for by formula (one), calculate and obtain described spy
Maximum information coefficient MIC between each feature and classification collection in collection;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is
The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model
Value after the mutual information normalization of lower maximum.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple,
Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, described selection module 330, specifically for by formula (two), according to described maximum letter
Breath coefficient MIC and described redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi
And fj, c is classification collection,For redundancy value.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple,
Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, this device also includes: predefined module 340;
Described predefined module 340, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains
Before the virtual value of each feature, define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described selection module 330, it is additionally operable to select successively from feature set according to described maximum information coefficient MIC
Take feature, and the feature chosen is deleted from feature set;Maximum information coefficient MIC and redundancy value according to the feature chosen obtain
Take the virtual value of described feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added by described virtual value
Add to optimal subset;Filter out from feature set according to described approximation Markov blanket condition and have approximation with the described feature chosen
All features of Markov blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;According to effectively
Value judge the virtual value of feature that filters out whether more than or equal to predetermined threshold value, if it is not, then by the feature that filters out from spy
Collection is deleted, and from feature set, chooses next feature, until described feature set is empty.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple,
Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, described selection module 330, it is additionally operable to the virtual value in this feature and is less than predetermined threshold value
Time, from feature set, choose next feature.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple,
Relevant part sees the part of method embodiment and illustrates.
It should be noted that, in all parts of assembly of the invention, the function to be realized according to it and to therein
Parts have carried out logical partitioning, but, the present invention is not only restricted to this, can as required all parts be repartitioned or
Person combines.
The all parts embodiment of the present invention can realize with hardware, or to transport on one or more processor
The software module of row realizes, or realizes with combinations thereof.In this device, PC is by realizing the Internet to equipment or device
Remotely control, control equipment or the step of each operation of device accurately.The present invention is also implemented as performing here
Part or all equipment of described method or device program (such as, computer program and computer program product
Product).It is achieved in that the program of the present invention can store on a computer-readable medium, and the file or document tool that program produces
Have and statistically can produce data report and cpk report etc., power amplifier can be carried out batch testing and add up.It should be noted
The present invention will be described rather than limits the invention to state embodiment, and those skilled in the art are without departing from institute
Replacement embodiment can be designed in the case of the scope of attached claim.In the claims, should not will be located between bracket
Any reference marks be configured to limitations on claims.Word " comprises " and does not excludes the presence of the unit not arranged in the claims
Part or step.Word "a" or "an" before being positioned at element does not excludes the presence of multiple such element.The present invention can borrow
Help include the hardware of some different elements and realize by means of properly programmed computer.If listing equipment for drying
Unit claim in, several in these devices can be specifically to be embodied by same hardware branch.Word first,
Second and third use do not indicate that any order.Can be title by these word explanations.
Although being described in conjunction with the accompanying embodiments of the present invention, but those skilled in the art can be without departing from this
Making various modifications and variations in the case of bright spirit and scope, such amendment and modification each fall within by claims
Within limited range.
Claims (10)
1. the feature selection approach of a high dimensional data, it is characterised in that including:
Obtaining pending raw data set, described raw data set includes feature set, some samples and classification collection, described class
Ji not include the classification of each sample;
Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition, and each is special
The redundancy value levied and select character subset;
According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and according to described effectively
Value selects character subset from feature set.
Method the most according to claim 1, it is characterised in that described calculating obtain in described feature set each feature with
The step of the maximum information coefficient MIC between classification collection specifically includes:
By formula (one), calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is to n
The hop count that individual feature divides, y is the hop count dividing n sample, M (D)x,yRepresent feature and sample under x*y stress and strain model
The big value after mutual information normalization.
Method the most according to claim 1, it is characterised in that described according to described maximum information coefficient MIC with described superfluous
Residual value, the step of the virtual value obtaining each feature specifically includes:
By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyiAnd fj, c
For classification collection,For redundancy value.
Method the most according to claim 3, it is characterised in that described according to described maximum information coefficient MIC with described
Redundancy value, before the step of the virtual value obtaining each feature, the method also includes:
Define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and
The step selecting character subset according to described virtual value from feature set specifically includes:
According to described maximum information coefficient MIC selected characteristic successively from feature set, and the feature chosen is deleted from feature set
Remove;
Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and have described in judgement
Whether valid value more than or equal to predetermined threshold value, the most then adds this feature to optimal subset.
Method the most according to claim 4, it is characterised in that described according to described maximum information coefficient MIC with described superfluous
Residual value, obtains the virtual value of each feature, and selects the step of character subset from feature set also according to described virtual value
Including:
Filter out from feature set according to described approximation Markov blanket condition and have approximation Markov with the described feature chosen
All features of blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;
Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will screening
The feature gone out is deleted from feature set, and chooses next feature from feature set.
6. the feature selection device of a high dimensional data, it is characterised in that including:
Acquisition module, for obtaining pending raw data set, described raw data set include feature set, some samples and
Classification collection, described classification collection includes the classification of each sample;
Processing module, for calculating the maximum information coefficient MIC obtaining in described feature set between each feature and classification collection,
And each feature and the redundancy value having selected character subset;
Select module, for according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature,
And from feature set, select character subset according to described virtual value.
Device the most according to claim 6, it is characterised in that described processing module, specifically for by formula (one), meter
Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is to n
The hop count that individual feature divides, y is the hop count dividing n sample, M (D)x,yRepresent feature and sample under x*y stress and strain model
The big value after mutual information normalization.
Device the most according to claim 6, it is characterised in that described selection module, specifically for by formula (two), root
According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyiAnd fj, c
For classification collection,For redundancy value.
Device the most according to claim 8, it is characterised in that this device also includes: predefined module;
Described predefined module, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains each special
Before the step of the virtual value levied, define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described selection module, it is additionally operable to according to described maximum information coefficient MIC selected characteristic successively from feature set,
And the feature chosen is deleted from feature set;Maximum information coefficient MIC and redundancy value according to the feature chosen obtain described
The virtual value of feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added to by described virtual value
Excellent subset.
Device the most according to claim 9, it is characterised in that described selection module, is additionally operable to according to described approximation Ma Er
Section husband blanket condition filters out from feature set all features approximating Markov blanket condition with the described feature chosen, and root
The virtual value of each feature filtered out is obtained according to formula two;Judge the virtual value of feature that filters out whether according to virtual value
More than or equal to predetermined threshold value, if it is not, then the feature filtered out deleted from feature set, and choose next from feature set
Individual feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610298079.XA CN105975589A (en) | 2016-05-06 | 2016-05-06 | Feature selection method and device of high-dimension data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610298079.XA CN105975589A (en) | 2016-05-06 | 2016-05-06 | Feature selection method and device of high-dimension data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105975589A true CN105975589A (en) | 2016-09-28 |
Family
ID=56991294
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610298079.XA Pending CN105975589A (en) | 2016-05-06 | 2016-05-06 | Feature selection method and device of high-dimension data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105975589A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106570178A (en) * | 2016-11-10 | 2017-04-19 | 重庆邮电大学 | High-dimension text data characteristic selection method based on graph clustering |
CN107478963A (en) * | 2017-09-30 | 2017-12-15 | 山东海兴电力科技有限公司 | Single-phase ground fault line selecting method of small-electric current grounding system based on power network big data |
CN109101626A (en) * | 2018-08-13 | 2018-12-28 | 武汉科技大学 | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree |
CN109443766A (en) * | 2018-09-10 | 2019-03-08 | 中国人民解放军火箭军工程大学 | A kind of heavy-duty vehicle gearbox gear Safety Analysis Method |
CN110334546A (en) * | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | Difference privacy high dimensional data based on principal component analysis optimization issues guard method |
CN110647943A (en) * | 2019-09-26 | 2020-01-03 | 西北工业大学 | Cutting tool wear monitoring method based on evolutionary data cluster analysis |
CN111016914A (en) * | 2019-11-22 | 2020-04-17 | 华东交通大学 | Dangerous driving scene identification system based on portable terminal information and identification method thereof |
CN111312403A (en) * | 2020-01-21 | 2020-06-19 | 山东师范大学 | Disease prediction system, device and medium based on instance and feature sharing cascade |
CN112465251A (en) * | 2020-12-08 | 2021-03-09 | 上海电力大学 | Short-term photovoltaic output probability prediction method based on simplest gated neural network |
CN115729957A (en) * | 2022-11-28 | 2023-03-03 | 安徽大学 | Unknown stream feature selection method and device based on maximum information coefficient |
CN116305292A (en) * | 2023-05-17 | 2023-06-23 | 中国电子科技集团公司第十五研究所 | Government affair data release method and system based on differential privacy protection |
-
2016
- 2016-05-06 CN CN201610298079.XA patent/CN105975589A/en active Pending
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106570178B (en) * | 2016-11-10 | 2020-09-29 | 重庆邮电大学 | High-dimensional text data feature selection method based on graph clustering |
CN106570178A (en) * | 2016-11-10 | 2017-04-19 | 重庆邮电大学 | High-dimension text data characteristic selection method based on graph clustering |
CN107478963A (en) * | 2017-09-30 | 2017-12-15 | 山东海兴电力科技有限公司 | Single-phase ground fault line selecting method of small-electric current grounding system based on power network big data |
CN109101626A (en) * | 2018-08-13 | 2018-12-28 | 武汉科技大学 | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree |
CN109443766A (en) * | 2018-09-10 | 2019-03-08 | 中国人民解放军火箭军工程大学 | A kind of heavy-duty vehicle gearbox gear Safety Analysis Method |
CN110334546A (en) * | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | Difference privacy high dimensional data based on principal component analysis optimization issues guard method |
CN110334546B (en) * | 2019-07-08 | 2021-11-23 | 辽宁工业大学 | Difference privacy high-dimensional data release protection method based on principal component analysis optimization |
CN110647943A (en) * | 2019-09-26 | 2020-01-03 | 西北工业大学 | Cutting tool wear monitoring method based on evolutionary data cluster analysis |
CN111016914A (en) * | 2019-11-22 | 2020-04-17 | 华东交通大学 | Dangerous driving scene identification system based on portable terminal information and identification method thereof |
CN111016914B (en) * | 2019-11-22 | 2021-04-06 | 华东交通大学 | Dangerous driving scene identification system based on portable terminal information and identification method thereof |
CN111312403A (en) * | 2020-01-21 | 2020-06-19 | 山东师范大学 | Disease prediction system, device and medium based on instance and feature sharing cascade |
CN112465251A (en) * | 2020-12-08 | 2021-03-09 | 上海电力大学 | Short-term photovoltaic output probability prediction method based on simplest gated neural network |
CN115729957A (en) * | 2022-11-28 | 2023-03-03 | 安徽大学 | Unknown stream feature selection method and device based on maximum information coefficient |
CN115729957B (en) * | 2022-11-28 | 2024-01-19 | 安徽大学 | Unknown stream feature selection method and device based on maximum information coefficient |
CN116305292A (en) * | 2023-05-17 | 2023-06-23 | 中国电子科技集团公司第十五研究所 | Government affair data release method and system based on differential privacy protection |
CN116305292B (en) * | 2023-05-17 | 2023-08-08 | 中国电子科技集团公司第十五研究所 | Government affair data release method and system based on differential privacy protection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105975589A (en) | Feature selection method and device of high-dimension data | |
CN110879351B (en) | Fault diagnosis method for non-linear analog circuit based on RCCA-SVM | |
CN112612664B (en) | Electronic equipment testing method and device, electronic equipment and storage medium | |
AU2297902A (en) | Multivariate responses using classification and regression trees systems and methods | |
Rakitzis et al. | CUSUM control charts for the monitoring of zero‐inflated binomial processes | |
CN111046043A (en) | Method for quickly and accurately checking database table | |
CN113465734B (en) | Real-time estimation method for structural vibration | |
CN107577738A (en) | A kind of FMECA method by SVM text mining processing datas | |
CN115373976A (en) | Insurance testing method and device, computer equipment and storage medium | |
DE102019214759A1 (en) | Provision of compensation parameters for integrated sensor circuits | |
Movaffagh et al. | Monotonic change point estimation in the mean vector of a multivariate normal process | |
US11496379B1 (en) | Network traffic analysis method and device based on multi-source network traffic data | |
CN109940462A (en) | The detection of milling cutter cutting vibration variation characteristic and Gaussian process model building method | |
Xia et al. | A study on the significance of software metrics in defect prediction | |
CN114398228A (en) | Method and device for predicting equipment resource use condition and electronic equipment | |
CN108062325A (en) | Comparative approach and comparison system | |
CN107145627A (en) | A kind of method for building belt restraining least square maximum entropy tantile function model | |
CN111400644B (en) | Calculation processing method for laboratory analysis sample | |
CN114385436A (en) | Server grouping method and device, electronic equipment and storage medium | |
CN107016073A (en) | A kind of text classification feature selection approach | |
CN109357957B (en) | Fatigue monitoring counting method based on extreme value window | |
CN113515560A (en) | Vehicle fault analysis method and device, electronic equipment and storage medium | |
CN112329108A (en) | Optimized anti-floating checking calculation method and system for subway station | |
Suprayitno | Searching the correct and appropriate deterrence function general formula for calculating gravity trip distribution model | |
TWI530809B (en) | Quality management system and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160928 |
|
RJ01 | Rejection of invention patent application after publication |