CN107391433A

CN107391433A - A kind of feature selection approach based on composite character KDE conditional entropies

Info

Publication number: CN107391433A
Application number: CN201710526050.7A
Authority: CN
Inventors: 代建华; 徐思琪; 高帅超
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2017-11-24
Anticipated expiration: 2037-06-30
Also published as: CN107391433B

Abstract

The present invention provides a kind of new feature selection approach based on composite character KDE conditional entropies, it this method propose composite character KDE probability and composite character KDE entropys, discrete features and continuous feature are effectively unified in KDE entropys in the case of discrete not by continuous data, information theory has been expanded and has proposed the greedy feature selection approach based on composite character KDE conditional entropies.

Description

A kind of feature selection approach based on composite character KDE conditional entropies

Technical field

The present invention relates to feature selection approach, in particular to a kind of feature selecting side based on composite character KDE conditional entropies Method.

Background technology

With the raising of the storage capacity and computing capability of data, size of data and data dimension are increasing, give Data mining or machine learning task bring bigger pressure.Feature selecting is as data mining, pattern-recognition, machine learning The important pre-treatment step of task, from substantial amounts of attribute, redundancy, unrelated attribute are eliminated, data dimension is reduced, improves The efficiency of algorithm.

The concepts such as entropy and mutual information in information theory occupy consequence in feature selecting, possess and know without priori The advantages that knowing detection non-linear relation, anti-noise jamming.But based on the feature selection approach of information theory primarily directed to from Attribute is dissipated, for continuous feature, takes the mode of discretization mostly to adapt to traditional feature selection approach.Density Estimator (KDE) it is that a kind of probability density function to stochastic variable carries out the method without ginseng estimation.By the entropy knot in KDE and information theory Close, being currently based on the feature selection approach of KDE entropys has preferable effect, but existing method is just for continuous feature. Against this problem, the present invention has expanded information theory so that can be applied to composite character based on KDE entropys.

The content of the invention

The invention aims to handle the feature selecting of composite character, and propose a kind of new based on composite character The feature selection approach of KDE conditional entropies.Composite character KDE entropys are this method propose, information theory has been expanded and has proposed and be based on The greedy feature selection approach of composite character KDE conditional entropies.

It is of the invention to be using technical scheme：

A kind of feature selection approach based on composite character KDE conditional entropies, comprises the following steps：

Step 1, input include decision-making feature D data set U, wherein, data set U has n sample, decision-making feature D= { 1,2 ..., N }, discrete features vector Α={ A₁,A₂,...,A_m, continuous characteristic vector X={ X₁,X₂,...,X_t, window width ginseng Number h, outage threshold T；

Step 2, if the feature set selected is B, non-selected feature set is E, and initial value is set toE=A ∪ X, The often difference of the conditional entropy before and after one feature of selection

Step 3, temporal aspect collection B ' is established by all properties in each the attribute S and feature set B in feature set E；

Step 4, for each value x of continuous feature set X ' in B ', and each value d in decision kind set D And the middle discrete features collection A ' of B ' each value a, calculate KDE probability With

Step 5, composite character KDE conditional entropies are based on by obtaining KDE probability calculations in step 4WithAnd based on composite character KDE combination entropiesWherein rememberFor Category Attributes collection A ' codomain,For Connection attribute collection X ' codomain,For decision set D codomain；

Step 6, the minimum attribute of alternative condition entropyIt is added in feature set B, obtains Attribute B=B ∪ { S* } have been selected, and E=E- { S* } is deleted from non-selected feature set；

Step 7, the difference of the conditional entropy added before and after new attribute is obtained by B=B ∪ { S* } in step 6, i.e.,

Step 8, the difference of the conditional entropy of judgment step sevenThe Characteristic Number whether being more than in threshold value T and feature set B Less than the characteristic of data lump, i.e.,If meeting condition, return to step three；Otherwise export Feature set B.

KDE probability in the step 4Generated by formula (1)：

KDE probability in the step 4Generated by formula (2)：

KDE probability in the step 4Generated by formula (3)：

KDE probability in the step 4Generated by formula (4)：

The step 5 conditional entropyGenerated by formula (5)：

The step 5 conditional entropyGenerated by formula (6)：

The step 5 conditional entropyGenerated by formula (7)：

Wherein, due to the sample probability of continuous random variable,

The beneficial effects of the invention are as follows：

1st, the present invention proposes composite character KDE probability and composite character KDE entropys, not by the discrete situation of continuous data It is lower that discrete features and continuous feature are effectively unified in KDE entropys.

2nd, the present invention is used based on standard of the composite character KDE conditional entropies as evaluating characteristic, is carried out using greedy algorithm Feature selecting.

3rd, the method that the present invention passes through Experimental comparison's discretization continuous data, it was demonstrated that this algorithm has in various classification experiments More preferable effect.

Brief description of the drawings

Fig. 1 is flow chart of the method for the present invention；

Fig. 2 is the implementing procedure figure of the present invention.

Embodiment

The present invention is further analyzed with reference to specific embodiment.

The method flow of the present invention is shown in Fig. 1, and based on the definition of above-mentioned composite character KDE entropys, the present invention is based on composite character The feature selection approach of KDE conditional entropies is described in detail below：

Step 1 101, input include decision-making feature D data set U；Wherein, data set U has n sample, decision-making feature D ={ 1,2 ..., N }, discrete features vector Α={ A₁,A₂,...,A_m, continuous characteristic vector X={ X₁,X₂,...,X_t, window width Parameter h, outage threshold T；

Step 2 102, if the feature set selected is B, non-selected feature set is E, and initial value is set toE=A ∪ X, the often conditional entropy before and after one feature of selection difference

Step 3 103, temporal aspect collection is established by all properties in each the attribute S and feature set B in feature set E B ', perform following steps；

Step 4 104, for each value x of continuous feature set X ' in B ', and it is every in decision kind set D A kind of middle discrete features collection A ' of value d and B ' each value a, calculate KDE probability With

The KDE probabilityGenerated by formula (1)：

The KDE probabilityGenerated by formula (2)：

The KDE probabilityGenerated by formula (3)：

The KDE probabilityGenerated by formula (4)：

Step 5 105, composite character KDE conditional entropies are based on by obtaining KDE probability calculations in step 4WithAnd based on composite character KDE combination entropiesWherein rememberFor Category Attributes collection A ' codomain, For connection attribute collection X ' codomain,For decision set D codomain；

The step 5 conditional entropyGenerated by formula (5)：

The step 5 conditional entropyGenerated by formula (6)：

The step 5 conditional entropyGenerated by formula (7)：

Wherein, due to the sample probability of continuous random variable,

Step 6 106, the minimum attribute of alternative condition entropyIt is added in feature set B, Attribute B=B ∪ { S* } have been selected in acquisition, and E=E- { S* } is deleted from non-selected feature set；

Step 7 107, the difference of the conditional entropy added before and after new attribute is obtained by B=B ∪ { S* } in step 6, i.e.,

Step 8 108, the conditional entropy of judgment step sevenThe Characteristic Number whether being more than in threshold value T and feature set B is small In the characteristic of data lump, i.e.,If meeting condition, return to step three；Otherwise output is special Collect B.

Implementing procedure is shown in Fig. 2, is specifically：

(1) input data set U, window width h, outage threshold T

(2) character subset is obtained by the feature selection approach proposed by the present invention based on composite character KDE conditional entropies

(3) output result

Experimental example 1：

By the operation by the inventive method (abbreviation GS_KDE) on real data collection hepatitis, two kinds are compared for Using the method for sliding-model control, one kind is wide discrete (abbreviation GS_eqW, section number parameter take 2,4,6), and another kind is Its validity is shown Deng frequency discrete (abbreviation GS_eqF, section number parameter take 2,4,6).Wherein, every kind of method all selects best Parameter.The result of operation is as shown in table 1：Wherein, data set derives from disclosed UCI data warehouses (http:// archive.ics.uci.edu/ml)；Outage threshold T=0.01, h=k/log₂N (k takes 1,2,3), wherein n are data sample Quantity.Classification accuracy is the average value of five folding cross validations, and the grader used is KNN (k=3), C4.5, PART.

Test result indicates that of the invention classified based on composite character KDE conditional entropies feature selection approach (GS_KDE) Accuracy rate on better than GS_eqW and GS_eqF, be also better than feature complete or collected works.

The characteristic of table 1 and classification accuracy

It is that the present invention is not limited only to above-described embodiment, as long as meeting for limitation of the invention that above-described embodiment, which is not, Application claims, belong to protection scope of the present invention.

Claims

1. a kind of feature selection approach based on composite character KDE conditional entropies, it is characterised in that comprise the following steps：

Step 1, input include decision-making feature D data set U, wherein, data set U has a n sample, decision-making feature D=1, 2 ..., N }, discrete features vector Α={ A₁,A₂,...,A_m, continuous characteristic vector X={ X₁,X₂,...,X_t, window width H, outage threshold T；

Step 2, if the feature set selected is B, non-selected feature set is E, and initial value is set toE=A ∪ X, are often selected Select the difference of the conditional entropy before and after a feature

Step 3, temporal aspect collection B ' is established by all properties in each the attribute S and feature set B in feature set E, performed Following steps；

Step 5, composite character KDE conditional entropies are based on by obtaining KDE probability calculations in step 4WithAnd based on composite character KDE combination entropiesWherein rememberFor Category Attributes collection A ' codomain, For connection attribute collection X ' codomain,For decision set D codomain；

Step 6, the minimum attribute of alternative condition entropyIt is added in feature set B, has been selected Attribute B=B ∪ { S* }, and E=E- { S* } is deleted from non-selected feature set；

Step 8, the difference of the conditional entropy of judgment step sevenWhether the Characteristic Number being more than in threshold value T and feature set B is less than The characteristic of data lump, i.e.,If meeting condition, return to step three；Otherwise output characteristic Collect B.

A kind of 2. feature selection approach based on composite character KDE conditional entropies according to claim 1, it is characterised in that KDE probability in the step 4Generated by formula (1)：

<mrow> <mtable> <mtr> <mtd> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>x</mi> <mo>,</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>=</mo> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>d</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>d</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <msub> <mi>n</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> <mi>n</mi> </mfrac> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

A kind of 3. feature selection approach based on composite character KDE conditional entropies according to claim 1, it is characterised in that The step 4 KDE probabilityGenerated by formula (2)：

<mrow> <mtable> <mtr> <mtd> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>d</mi> <mo>|</mo> <mi>x</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>d</mi> <mo>,</mo> <mi>x</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mi>a</mi> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> 1

A kind of 4. feature selection approach based on composite character KDE conditional entropies according to claim 1, it is characterised in that The step 4 KDE probabilityGenerated by formula (3)：

<mrow> <mtable> <mtr> <mtd> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>d</mi> <mo>,</mo> <mi>a</mi> <mo>|</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>d</mi> <mo>,</mo> <mi>x</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

A kind of 5. feature selection approach based on composite character KDE conditional entropies according to claim 1, it is characterised in that The step 4 KDE probabilityGenerated by formula (4)：

<mrow> <mtable> <mtr> <mtd> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>a</mi> <mo>|</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>p</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>d</mi> <mo>,</mo> <mi>x</mi> <mo>,</mo> <mi>a</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> <mfrac> <msub> <mi>n</mi> <mi>d</mi> </msub> <mi>n</mi> </mfrac> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>d</mi> </msub> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>I</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>a</mi> </mrow> </msub> </mrow> </munder> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>h</mi> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

A kind of 6. feature selection approach based on composite character KDE conditional entropies according to claim 1, it is characterised in that The step 5 conditional entropyGenerated by formula (5)：

A kind of 7. feature selection approach based on composite character KDE conditional entropies according to claim 1, it is characterised in that The step 5 conditional entropyGenerated by formula (6)：

A kind of 8. feature selection approach based on composite character KDE conditional entropies according to claim 1, it is characterised in that The step 5 conditional entropyGenerated by formula (7)：