CN105975589A - Feature selection method and device of high-dimension data - Google Patents

Feature selection method and device of high-dimension data Download PDF

Info

Publication number
CN105975589A
CN105975589A CN201610298079.XA CN201610298079A CN105975589A CN 105975589 A CN105975589 A CN 105975589A CN 201610298079 A CN201610298079 A CN 201610298079A CN 105975589 A CN105975589 A CN 105975589A
Authority
CN
China
Prior art keywords
feature
value
mic
maximum information
information coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610298079.XA
Other languages
Chinese (zh)
Inventor
孙广路
宋智超
陈腾
何勇军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN201610298079.XA priority Critical patent/CN105975589A/en
Publication of CN105975589A publication Critical patent/CN105975589A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention discloses a feature selection method and device of high-dimension data. The method comprises the following steps: obtaining an original data set to be processed, wherein the original data set comprises a feature set, a plurality of samples and a category set, and the category set comprises the category of each sample; calculating to obtain a MIC (Maximum Information Coefficient) between each feature in the feature set and the category set, and the redundant value of each feature and a selected feature subset; and according to the MIC and the redundant value, obtaining the effective value of each feature, and selecting the feature subset from the feature set according to the effective value. The MIC is introduced into feature selection, and the feature is effectively evaluated on the basis of the MIC so as to select features according to the effective value generated by evaluation. Compared with the prior art, the feature selection method can effectively improve accuracy for high-dimension data feature selection.

Description

The feature selection approach of a kind of high dimensional data and device
Technical field
The present invention relates to data mining technology field, be specifically related to feature selection approach and the device of a kind of high dimensional data.
Background technology
The information-intensive society developed rapidly all is producing the data of magnanimity every day, excavates the most rapidly and have from these data Information become urgent problem.Researchers solve this problem from the angle of machine learning model, and obtain Remarkable break-throughs.But, the model of high complexity and high-dimensional feature space are increasingly difficult in adapt to compeling of big market demand Highly necessary ask, and feature space often also exists a large amount of garbage.Only use suitable feature selection approach, Cai Nengcong Mass data obtains effective feature, and then improves efficiency and the accuracy rate of machine learning model process data;Feature simultaneously Selection can also prevent model over-fitting and carry out denoising.Important accordingly, as of machine learning and data mining Pre-treatment step, feature selection always is the study hotspot in machine learning field.
Choosing of the module of feature selection and searching algorithm is most important.Conventional module have based on distance, Theory of information and conforming module.Module based on distance, Pearson coefficient isometry standard can only weigh variable Between linear relationship, and information gain, mutual information isometry standard, non-linear relation can be measured.Generating spy When levying subset, generally require the corresponding searching algorithm of use, numerous search strategys approximates Markov blanket condition at meter All well and good performance is had on the classification accuracy of the feature calculating complexity and selection.But it also has obvious shortcoming, it is impossible to Consider the redundancy between feature and character subset.
Summary of the invention
For defect of the prior art, the invention provides feature selection approach and the device of a kind of high dimensional data, pin MIC can only be incorporated in feature selection by the tolerance in current techniques by the linear processes relationship metric between variable, MIC can linear processes relation between gauge variable widely, even can measure and single function can not be used to represent Non-functional relation.Although MIC is largely effective on variable metric, but the dependency between single variable and redundancy can only be measured Property, therefore set forth herein a kind of new tolerance mMIC (virtual value), and be applied to Markov blanket condition, to solve prior art Cause feature selection degree of accuracy low because being difficult to be suitable for the redundancy between feature and the character subset that high dimensional data is concentrated Problem.
The present invention proposes the feature selection approach of a kind of high dimensional data, including:
Obtaining pending raw data set, described raw data set includes feature set, some samples and classification collection, institute State classification collection and include the classification of each sample;
Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition, and each Individual feature and the redundancy value selecting character subset;
According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and according to described Virtual value selects character subset from feature set.
Preferably, described calculating obtains the maximum information coefficient in described feature set between each feature and classification collection The step of MIC specifically includes:
By formula (one), calculate the maximum information coefficient between each feature and classification collection in the described feature set of acquisition MIC;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model Value after the mutual information normalization of lower maximum.
Preferably, described according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature Step specifically include:
By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi And fj, c is classification collection,For redundancy value.
Preferably, described according to described maximum information coefficient MIC and described redundancy value, the effective of each feature is obtained Before the step of value, the method also includes:
Define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value, and from feature set, select the step of character subset according to described virtual value and specifically include:
According to described maximum information coefficient MIC selected characteristic successively from feature set, and by the feature chosen from feature set Middle deletion;
Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and judge institute State whether virtual value is more than or equal to predetermined threshold value, the most then this feature is added to optimal subset.
Preferably, described according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value, and from feature set, select the step of character subset according to described virtual value and also include:
Filter out from feature set according to described approximation Markov blanket condition and have approximation Ma Er with the described feature chosen All features of section husband blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;
Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will The feature filtered out is deleted from feature set, and chooses next feature from feature set.
The invention allows for the feature selection device of a kind of high dimensional data, it is characterised in that including:
Acquisition module, for obtaining pending raw data set, described raw data set includes feature set, some samples And classification collection, described classification collection includes the classification of each sample;
Processing module, for calculating the maximum information coefficient obtaining in described feature set between each feature and classification collection MIC, and each feature and the redundancy value having selected character subset;
Select module, for according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value, and from feature set, select character subset according to described virtual value.
Preferably, described processing module, specifically for by formula (one), calculate and obtain each spy in described feature set Levy the maximum information coefficient MIC between classification collection;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model Value after the mutual information normalization of lower maximum.
Preferably, described selection module, specifically for by formula (two), according to described maximum information coefficient MIC and institute State redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi And fj, c is classification collection,For redundancy value.
Preferably, this device also includes: predefined module;
Described predefined module, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains each Before the step of the virtual value of individual feature, define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described selection module, it is additionally operable to from feature set, choose spy successively according to described maximum information coefficient MIC Levy, and the feature chosen is deleted from feature set;Maximum information coefficient MIC and redundancy value according to the feature chosen obtain institute State the virtual value of feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added extremely by described virtual value Optimal subset.
Preferably, described selection module, it is additionally operable to filter out from feature set according to described approximation Markov blanket condition There are all features approximating Markov blanket condition with the described feature chosen, and obtain what each filtered out according to formula two The virtual value of feature;Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, Then the feature filtered out is deleted from feature set, and from feature set, choose next feature
As shown from the above technical solution, the feature selection approach of the high dimensional data that the present invention proposes, by maximum information system Number is incorporated in feature selection, is simultaneously based on maximum information and high dimensional data is carried out feature selection, only to overcome prior art Dependency and the shortcoming of redundancy between two features can be considered, improve the classification accuracy of the feature of selection.
Accompanying drawing explanation
By being more clearly understood from the features and advantages of the present invention with reference to accompanying drawing, accompanying drawing is schematic and should not manage Solve as the present invention is carried out any restriction, in the accompanying drawings:
Fig. 1 shows the schematic flow sheet of the feature selection approach of a kind of high dimensional data that one embodiment of the invention proposes;
Fig. 2 shows the flow process signal of the feature selection approach of a kind of high dimensional data that another embodiment of the present invention proposes Figure;
Fig. 3 shows the structural representation of the feature selection device of a kind of high dimensional data that one embodiment of the invention proposes.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is A part of embodiment of the present invention rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained on the premise of not making creative work, broadly falls into the scope of protection of the invention.
Fig. 1 is the schematic flow sheet of the feature selection approach of a kind of high dimensional data that one embodiment of the invention proposes, reference Fig. 1, the feature selection approach of this high dimensional data, including:
110, obtaining pending raw data set, described raw data set includes feature set, some samples and classification Collection, described classification collection includes the classification of each sample;
120, the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition is calculated, and Each feature and the redundancy value selecting character subset;
130, according to described maximum information coefficient MIC and described redundancy value, the virtual value of each feature is obtained, and according to Described virtual value selects character subset from feature set.
The present invention is incorporated in feature selection by maximum information coefficient, is simultaneously based on maximum information and carries out high dimensional data Feature selection, because it is difficult to be suitable for the redundancy between feature and the character subset that high dimensional data is concentrated and cause feature selection essence The problem that exactness is low, improves the classification accuracy of the feature of selection.
In the present embodiment, the process calculating MIC in step 120 specifically includes:
By formula (one), calculate the maximum information coefficient between each feature and classification collection in the described feature set of acquisition MIC;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model Value after the mutual information normalization of lower maximum.
In the present embodiment, step 130 specifically includes:
By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the effective of each feature Value;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi And fj, c is classification collection,For redundancy value.
In the present embodiment, before step 130, the method also includes:
Define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, step 130 specifically includes:
According to described maximum information coefficient MIC selected characteristic successively from feature set, and by the feature chosen from feature set Middle deletion;
Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and judge institute State whether virtual value is more than or equal to predetermined threshold value, the most then this feature is added to optimal subset;
Filter out from feature set according to described approximation Markov blanket condition and have approximation Ma Er with the described feature chosen All features of section husband blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;
Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will The feature filtered out is deleted from feature set, and chooses next feature from feature set, until feature set F is empty.
Fig. 2 is the schematic flow sheet of the feature selection approach of a kind of high dimensional data that another embodiment of the present invention proposes, under The principle of the present invention is described in detail by face with reference to Fig. 2:
The method includes that initial phase and feature delete the stage;
One, initial phase includes:
S1, given data set D have m feature and n sample, and its feature set comprised is F={f1,f2,...,fm, Classification collection c={c1,c2,...,cnInclude the classification of each sample in data set.Carry out data prediction, optimal characteristics is set Subset S is empty, and setup parameter θ, parameter θ herein is above-mentioned predetermined threshold value;
Two, feature is deleted the stage and is comprised step:
S2, calculate the maximum information coefficient between each feature and classification collection in feature set, and according to feature and classification collection MIC (c;fi) value carries out descending sort, wherein, f to featureiFor ith feature, i is more than 0 and less than or equal to m;
S3, propose approximation Markov blanket condition and virtual value mMIC evaluation function according to the present invention, feature set is entered Row processes, and deletes unrelated and redundancy feature, obtains last character subset;
Preferably, step S1 specifically includes:
S11, data set D is carried out data prediction, obtain the file format of requirement;
S12, optimal feature subset S is initialized as empty set, parameter θ is initialized;
Preferably, step S2 specifically includes:
S21, to arbitrary characteristics f in feature set Fi, calculate the maximum information coefficient value MIC between this feature and classification collection (c;fi);
S22, according to MIC (c;fi) feature is carried out descending sort;
Preferably, the approximation Markov blanket conditional definition described in step S3 is as follows:
For two features fiAnd fj(i ≠ j, j are more than 0 and less than or equal to m) and classification c, fiIt is fjApproximation Ma Erke The condition of husband's blanket is:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)。
Thus, the computing formula of maximum information coefficient is as follows:
M I C ( D ) = m a x x y < B ( n ) { M ( D ) x , y }
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1.Usually, B (n)=n0 . 6 Shi Xiaoguo is best.X Yu y represents the hop count dividing two variable codomains.M (D) in formulax,yRepresent that two variablees are drawn at x*y grid The value after mutual information normalization maximum under point.
M(D)x,yComputing formula as follows:
M ( D ) x , y = MI * ( D , x , y ) log min { x , y }
Wherein, MI*(D, x y) represent the mutual information of maximum under x*y stress and strain model.
MI*(D, x, computing formula y) is as follows:
MI*(D, x, y)=maxMI (D | G)
Wherein, D | G is that data set D uses G (x*y grid) to divide, and then solves the mutual information of each grid.And formula The computing formula of middle mutual information is as follows:
M I ( A , B ) = &Sigma; p ( a , b ) l o g p ( a , b ) p ( a ) p ( b )
Wherein A={ai, i=1...n} and B={bi, i=1...n}.
Preferably, evaluation function mMIC based on maximum information coefficient in step S3, can be between feature and classification Dependency between dependency and feature and character subset is measured, and then the quality of judging characteristic.
The computing formula of mMIC evaluation function is as follows:
m M I C ( j ) = M I C ( j , c ) j &Element; S r e s i d u e - 1 | S m a i n | &Sigma; i &Element; S m a i n M I C ( j , i )
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset.On simplifying and stating Convenience uses i and j to represent feature f respectivelyiAnd fj.Above formula represents feature f selected from residue character subsetjIts quality is passed through This feature determines with the current redundancy having character subset with the dependency of classification collection and this feature.
Preferably, step S3 comprises step:
S31, repetition operations described below are until F is empty set;
A. from feature set F, select MIC (c;fi) the maximum feature of value;
B. feature f is deleted from feature set FiIf it is in redundancy subset SreIn, then calculate the mMIC value of this feature, if MMIC value is less than θ, returns to step a;Otherwise directly by fiAdd in optimal subset S, and by fiContinue executing with as host element Step c;
C. from feature set F, search for the host element f to select in aiFor approximating all elements of Markov blanket condition, will Feature f selectedjJoin SreIn and calculate the mMIC value of all elements selected.If feature fjMMIC value less than θ then By feature fjDelete from F;
D., after said process terminates, the character subset S of output is optimal feature subset.
In sum, the present invention by joining in approximation Markov blanket model by mMIC so that approximation Markov Blanket condition can weigh the power of the redundancy between the dependency between single feature and classification and this feature and character subset, Determine the going or staying of feature.Both ensure that approximation Markov blanket condition carried out the efficiency of feature selection and also ensure that the spy selected Levy the accuracy of selection.
Fig. 3 is the structural representation of the feature selection device of a kind of high dimensional data that one embodiment of the invention proposes, reference Fig. 3, this device includes:
Acquisition module 310, for obtaining pending raw data set, described raw data set includes feature set, some Sample and classification collection, described classification collection includes the classification of each sample;
Processing module 320, for calculating the maximum information obtaining in described feature set between each feature and classification collection Coefficient MIC, and each feature and the redundancy value having selected character subset;
Select module 330, for according to described maximum information coefficient MIC and described redundancy value, obtain each feature Virtual value, and from feature set, select character subset according to described virtual value.
The present invention is incorporated in feature selection by maximum information coefficient, is simultaneously based on maximum information and carries out high dimensional data Feature selection, can only consider the shortcoming of dependency and redundancy between two features overcoming prior art, improve selection The classification accuracy of feature.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple, Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, described processing module 320, specifically for by formula (one), calculate and obtain described spy Maximum information coefficient MIC between each feature and classification collection in collection;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is The hop count dividing n feature, y is the hop count dividing n sample, M (D)x,yRepresent that feature and sample are at x*y stress and strain model Value after the mutual information normalization of lower maximum.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple, Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, described selection module 330, specifically for by formula (two), according to described maximum letter Breath coefficient MIC and described redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyi And fj, c is classification collection,For redundancy value.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple, Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, this device also includes: predefined module 340;
Described predefined module 340, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains Before the virtual value of each feature, define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described selection module 330, it is additionally operable to select successively from feature set according to described maximum information coefficient MIC Take feature, and the feature chosen is deleted from feature set;Maximum information coefficient MIC and redundancy value according to the feature chosen obtain Take the virtual value of described feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added by described virtual value Add to optimal subset;Filter out from feature set according to described approximation Markov blanket condition and have approximation with the described feature chosen All features of Markov blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;According to effectively Value judge the virtual value of feature that filters out whether more than or equal to predetermined threshold value, if it is not, then by the feature that filters out from spy Collection is deleted, and from feature set, chooses next feature, until described feature set is empty.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple, Relevant part sees the part of method embodiment and illustrates.
In a possible embodiments, described selection module 330, it is additionally operable to the virtual value in this feature and is less than predetermined threshold value Time, from feature set, choose next feature.
For device embodiments, due to itself and method embodiment basic simlarity, so describe is fairly simple, Relevant part sees the part of method embodiment and illustrates.
It should be noted that, in all parts of assembly of the invention, the function to be realized according to it and to therein Parts have carried out logical partitioning, but, the present invention is not only restricted to this, can as required all parts be repartitioned or Person combines.
The all parts embodiment of the present invention can realize with hardware, or to transport on one or more processor The software module of row realizes, or realizes with combinations thereof.In this device, PC is by realizing the Internet to equipment or device Remotely control, control equipment or the step of each operation of device accurately.The present invention is also implemented as performing here Part or all equipment of described method or device program (such as, computer program and computer program product Product).It is achieved in that the program of the present invention can store on a computer-readable medium, and the file or document tool that program produces Have and statistically can produce data report and cpk report etc., power amplifier can be carried out batch testing and add up.It should be noted The present invention will be described rather than limits the invention to state embodiment, and those skilled in the art are without departing from institute Replacement embodiment can be designed in the case of the scope of attached claim.In the claims, should not will be located between bracket Any reference marks be configured to limitations on claims.Word " comprises " and does not excludes the presence of the unit not arranged in the claims Part or step.Word "a" or "an" before being positioned at element does not excludes the presence of multiple such element.The present invention can borrow Help include the hardware of some different elements and realize by means of properly programmed computer.If listing equipment for drying Unit claim in, several in these devices can be specifically to be embodied by same hardware branch.Word first, Second and third use do not indicate that any order.Can be title by these word explanations.
Although being described in conjunction with the accompanying embodiments of the present invention, but those skilled in the art can be without departing from this Making various modifications and variations in the case of bright spirit and scope, such amendment and modification each fall within by claims Within limited range.

Claims (10)

1. the feature selection approach of a high dimensional data, it is characterised in that including:
Obtaining pending raw data set, described raw data set includes feature set, some samples and classification collection, described class Ji not include the classification of each sample;
Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition, and each is special The redundancy value levied and select character subset;
According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and according to described effectively Value selects character subset from feature set.
Method the most according to claim 1, it is characterised in that described calculating obtain in described feature set each feature with The step of the maximum information coefficient MIC between classification collection specifically includes:
By formula (one), calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is to n The hop count that individual feature divides, y is the hop count dividing n sample, M (D)x,yRepresent feature and sample under x*y stress and strain model The big value after mutual information normalization.
Method the most according to claim 1, it is characterised in that described according to described maximum information coefficient MIC with described superfluous Residual value, the step of the virtual value obtaining each feature specifically includes:
By formula (two), according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyiAnd fj, c For classification collection,For redundancy value.
Method the most according to claim 3, it is characterised in that described according to described maximum information coefficient MIC with described Redundancy value, before the step of the virtual value obtaining each feature, the method also includes:
Define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, and The step selecting character subset according to described virtual value from feature set specifically includes:
According to described maximum information coefficient MIC selected characteristic successively from feature set, and the feature chosen is deleted from feature set Remove;
Maximum information coefficient MIC according to the feature chosen and the virtual value of the redundancy value described feature of acquisition, and have described in judgement Whether valid value more than or equal to predetermined threshold value, the most then adds this feature to optimal subset.
Method the most according to claim 4, it is characterised in that described according to described maximum information coefficient MIC with described superfluous Residual value, obtains the virtual value of each feature, and selects the step of character subset from feature set also according to described virtual value Including:
Filter out from feature set according to described approximation Markov blanket condition and have approximation Markov with the described feature chosen All features of blanket condition, and the virtual value of each feature filtered out is obtained according to formula two;
Judge according to virtual value whether the virtual value of the feature filtered out is more than or equal to predetermined threshold value, if it is not, then will screening The feature gone out is deleted from feature set, and chooses next feature from feature set.
6. the feature selection device of a high dimensional data, it is characterised in that including:
Acquisition module, for obtaining pending raw data set, described raw data set include feature set, some samples and Classification collection, described classification collection includes the classification of each sample;
Processing module, for calculating the maximum information coefficient MIC obtaining in described feature set between each feature and classification collection, And each feature and the redundancy value having selected character subset;
Select module, for according to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature, And from feature set, select character subset according to described virtual value.
Device the most according to claim 6, it is characterised in that described processing module, specifically for by formula (one), meter Calculate the maximum information coefficient MIC between each feature and classification collection in the described feature set of acquisition;
Wherein, B (n) is the grid number delimited, ω (1)≤B (n)≤O (n1-ε), 0 < ε < 1, the number that n is characterized, x is to n The hop count that individual feature divides, y is the hop count dividing n sample, M (D)x,yRepresent feature and sample under x*y stress and strain model The big value after mutual information normalization.
Device the most according to claim 6, it is characterised in that described selection module, specifically for by formula (two), root According to described maximum information coefficient MIC and described redundancy value, obtain the virtual value of each feature;
Wherein, SmainFor the character subset currently selected, SresidueFor residue character subset, i and j represents feature f respectivelyiAnd fj, c For classification collection,For redundancy value.
Device the most according to claim 8, it is characterised in that this device also includes: predefined module;
Described predefined module, is used for described according to described maximum information coefficient MIC and described redundancy value, obtains each special Before the step of the virtual value levied, define the approximation Markov blanket condition between two features:
MIC(fi, c) > MIC (fj, c) and MIC (fj, c) < MIC (fi,fj)
Correspondingly, described selection module, it is additionally operable to according to described maximum information coefficient MIC selected characteristic successively from feature set, And the feature chosen is deleted from feature set;Maximum information coefficient MIC and redundancy value according to the feature chosen obtain described The virtual value of feature, and judge that this feature whether more than or equal to predetermined threshold value, is the most then added to by described virtual value Excellent subset.
Device the most according to claim 9, it is characterised in that described selection module, is additionally operable to according to described approximation Ma Er Section husband blanket condition filters out from feature set all features approximating Markov blanket condition with the described feature chosen, and root The virtual value of each feature filtered out is obtained according to formula two;Judge the virtual value of feature that filters out whether according to virtual value More than or equal to predetermined threshold value, if it is not, then the feature filtered out deleted from feature set, and choose next from feature set Individual feature.
CN201610298079.XA 2016-05-06 2016-05-06 Feature selection method and device of high-dimension data Pending CN105975589A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610298079.XA CN105975589A (en) 2016-05-06 2016-05-06 Feature selection method and device of high-dimension data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610298079.XA CN105975589A (en) 2016-05-06 2016-05-06 Feature selection method and device of high-dimension data

Publications (1)

Publication Number Publication Date
CN105975589A true CN105975589A (en) 2016-09-28

Family

ID=56991294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610298079.XA Pending CN105975589A (en) 2016-05-06 2016-05-06 Feature selection method and device of high-dimension data

Country Status (1)

Country Link
CN (1) CN105975589A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering
CN107478963A (en) * 2017-09-30 2017-12-15 山东海兴电力科技有限公司 Single-phase ground fault line selecting method of small-electric current grounding system based on power network big data
CN109101626A (en) * 2018-08-13 2018-12-28 武汉科技大学 Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree
CN109443766A (en) * 2018-09-10 2019-03-08 中国人民解放军火箭军工程大学 A kind of heavy-duty vehicle gearbox gear Safety Analysis Method
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN110647943A (en) * 2019-09-26 2020-01-03 西北工业大学 Cutting tool wear monitoring method based on evolutionary data cluster analysis
CN111016914A (en) * 2019-11-22 2020-04-17 华东交通大学 Dangerous driving scene identification system based on portable terminal information and identification method thereof
CN111312403A (en) * 2020-01-21 2020-06-19 山东师范大学 Disease prediction system, device and medium based on instance and feature sharing cascade
CN112465251A (en) * 2020-12-08 2021-03-09 上海电力大学 Short-term photovoltaic output probability prediction method based on simplest gated neural network
CN115729957A (en) * 2022-11-28 2023-03-03 安徽大学 Unknown stream feature selection method and device based on maximum information coefficient
CN116305292A (en) * 2023-05-17 2023-06-23 中国电子科技集团公司第十五研究所 Government affair data release method and system based on differential privacy protection

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570178B (en) * 2016-11-10 2020-09-29 重庆邮电大学 High-dimensional text data feature selection method based on graph clustering
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering
CN107478963A (en) * 2017-09-30 2017-12-15 山东海兴电力科技有限公司 Single-phase ground fault line selecting method of small-electric current grounding system based on power network big data
CN109101626A (en) * 2018-08-13 2018-12-28 武汉科技大学 Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree
CN109443766A (en) * 2018-09-10 2019-03-08 中国人民解放军火箭军工程大学 A kind of heavy-duty vehicle gearbox gear Safety Analysis Method
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN110334546B (en) * 2019-07-08 2021-11-23 辽宁工业大学 Difference privacy high-dimensional data release protection method based on principal component analysis optimization
CN110647943A (en) * 2019-09-26 2020-01-03 西北工业大学 Cutting tool wear monitoring method based on evolutionary data cluster analysis
CN111016914A (en) * 2019-11-22 2020-04-17 华东交通大学 Dangerous driving scene identification system based on portable terminal information and identification method thereof
CN111016914B (en) * 2019-11-22 2021-04-06 华东交通大学 Dangerous driving scene identification system based on portable terminal information and identification method thereof
CN111312403A (en) * 2020-01-21 2020-06-19 山东师范大学 Disease prediction system, device and medium based on instance and feature sharing cascade
CN112465251A (en) * 2020-12-08 2021-03-09 上海电力大学 Short-term photovoltaic output probability prediction method based on simplest gated neural network
CN115729957A (en) * 2022-11-28 2023-03-03 安徽大学 Unknown stream feature selection method and device based on maximum information coefficient
CN115729957B (en) * 2022-11-28 2024-01-19 安徽大学 Unknown stream feature selection method and device based on maximum information coefficient
CN116305292A (en) * 2023-05-17 2023-06-23 中国电子科技集团公司第十五研究所 Government affair data release method and system based on differential privacy protection
CN116305292B (en) * 2023-05-17 2023-08-08 中国电子科技集团公司第十五研究所 Government affair data release method and system based on differential privacy protection

Similar Documents

Publication Publication Date Title
CN105975589A (en) Feature selection method and device of high-dimension data
CN110879351B (en) Fault diagnosis method for non-linear analog circuit based on RCCA-SVM
CN112612664B (en) Electronic equipment testing method and device, electronic equipment and storage medium
AU2297902A (en) Multivariate responses using classification and regression trees systems and methods
Rakitzis et al. CUSUM control charts for the monitoring of zero‐inflated binomial processes
CN111046043A (en) Method for quickly and accurately checking database table
CN113465734B (en) Real-time estimation method for structural vibration
CN107577738A (en) A kind of FMECA method by SVM text mining processing datas
CN115373976A (en) Insurance testing method and device, computer equipment and storage medium
DE102019214759A1 (en) Provision of compensation parameters for integrated sensor circuits
Movaffagh et al. Monotonic change point estimation in the mean vector of a multivariate normal process
US11496379B1 (en) Network traffic analysis method and device based on multi-source network traffic data
CN109940462A (en) The detection of milling cutter cutting vibration variation characteristic and Gaussian process model building method
Xia et al. A study on the significance of software metrics in defect prediction
CN114398228A (en) Method and device for predicting equipment resource use condition and electronic equipment
CN108062325A (en) Comparative approach and comparison system
CN107145627A (en) A kind of method for building belt restraining least square maximum entropy tantile function model
CN111400644B (en) Calculation processing method for laboratory analysis sample
CN114385436A (en) Server grouping method and device, electronic equipment and storage medium
CN107016073A (en) A kind of text classification feature selection approach
CN109357957B (en) Fatigue monitoring counting method based on extreme value window
CN113515560A (en) Vehicle fault analysis method and device, electronic equipment and storage medium
CN112329108A (en) Optimized anti-floating checking calculation method and system for subway station
Suprayitno Searching the correct and appropriate deterrence function general formula for calculating gravity trip distribution model
TWI530809B (en) Quality management system and method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160928

RJ01 Rejection of invention patent application after publication