CN108073553A

CN108073553A - The unsupervised discretization method of connection attribute data based on comentropy

Info

Publication number: CN108073553A
Application number: CN201711450629.6A
Authority: CN
Inventors: 马生俊; 陈旺虎; 郭宏乐; 乔保民; 李新田
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2018-05-25
Anticipated expiration: 2037-12-27
Also published as: CN108073553B

Abstract

The present invention relates to big data Discretization for Continuous Attribute technical field more particularly to a kind of unsupervised discretization methods of connection attribute data based on comentropy.Step is as follows：All values record of step 1, any connection attribute of traversal counts the discrete granularity of the attribute | n_j| and the probability q of variant value_ji, record maximum n_j ^maxWith minimum value n_j ^min；Step 2 draws any connection attribute n according to the calculation formula of comentropy_jThe calculation formula of value confusion degree calculates the value confusion degree of the attribute according to formula；Step 3 obtains breakpoint number to the downward rounding of value confusion degree；Step 4, using wide section method computation partition each section width, determine the position of each breakpoint；Step 5, to connection attribute n_jCarry out discretization.The mode of this brand-new definite breakpoint number of the present invention, is suitable for former data, and each attribute discretization is independent of each other, and independent of other attributes, computational efficiency higher.

Description

The unsupervised discretization method of connection attribute data based on comentropy

Technical field

The present invention relates to big data Discretization for Continuous Attribute technical field more particularly to a kind of continuous categories based on comentropy The property unsupervised discretization method of data.

Background technology

Discretization for Continuous Attribute is that the value range of connection attribute is divided into several sections, and each section corresponds to unique Centrifugal pump, raw value is made to be changed into the process of centrifugal pump.Domestic and international researchers are discrete in connection attribute (numerical attribute) Very more methods is proposed in terms of change, by different angles there are many sorting technique, be respectively it is top-down with it is bottom-up, There are supervision and unsupervised, whole and local, static and dynamic and list attribute and more attributes etc..The essence of Discretization for Continuous Attribute The problem of being to determine centrifugal pump number (interval number, intervals) and breakpoint location, from the angle for how determining centrifugal pump number From the point of view of, mainly there are following a few class methods.

First, the method for the subjective specified centrifugal pump number of user.Typically have wide interval method (EWD), etc. frequencies interval method (EFD), the discretization method based on cluster and representative CADD (Class-Attribute Dependent Discretizer relies on class and the discretization of attribute) method etc., these methods are required for user to specify centrifugal pump number in advance K, although there is certain effect, due to lacking theoretical foundation, being difficult to accurately hold centrifugal pump number K, such method mainly lacks Point is user is needed to specify centrifugal pump number K.

Secondly, it is assumed that centrifugal pump number K is with recording methods of the number f or interval width d there are relation in section.Typically PD (Proportional Discretizer, inverse discrete) methods and FIMUS (Rahman and Islam, 2014) side Method, PD methods assume that the number K of centrifugal pump is equal to the record number f of each minizone, and K*f=D, and its essence is a kind of Deng Pin areas Between method；FIMUS methods assume that the number K of centrifugal pump is equal to the width d of each minizone, and K*d=[min, max], essence It is a kind of wide interval method.Both the above method avoids user's input parameter, but needs to assume condition in advance, gives tacit consent to single area Interior record number and the product of interval number are equal to record sum or value range, lack theoretical foundation.

Third, centrifugal pump number K is determined according to the relation between categorical attribute or Category Attributes.Method based on this thought Compare it is more, it is representative to have CAIM (Class-Attribute Interdependence Maximization, class and attribute phase Guan Du maximize), LFD (Low Frequency Discretizer, low frequency discretization), MDLP (Minimum Description Length Principle, Minimum description length criterion) the methods of.CAIM is a kind of improvement of CADD, should Method turns to target using caim values as discrete discriminate, to reach class with attributes correlation maximum, is mutually closed by generic attribute The heuristic rule of system generates interval number as minimum as possible；LFD is only to consider that connection attribute belongs to classification based on CAIM etc. Property relation carry out discretization method and most methods using high frequency value as Candidate point on the basis of improved, this method exists It carries out considering the relation with all properties during Discretization for Continuous Attribute, such as categorical attribute, Category Attributes and the category being discretized Property, and using low frequency value as Candidate point；The method of similar CAIM and LFD generated in discretization to categorical attribute or other The dependence of attribute.MDLP methods be based on comentropy and minimum length description classical way, the recursive selection breakpoint of this method, Attempt the information content of minimum model, mdlp is utilized to determine suitable centrifugal pump number；Similarly, using comentropy thought into The method of row discretization has very much, but is mostly determined whether to merge or divide section according to some form of entropy.

In above method, the method for the subjective specified centrifugal pump number of user lacks the adaptability to former data；It is assumed that condition Method lack theoretical foundation；By didactic method discretization process is caused to rely on other attributes；Using the side of comentropy Method is not to determine centrifugal pump number according to the comentropy of connection attribute, and it is larger to calculate cost.

The content of the invention

To solve the above problems, the present invention provides a kind of unsupervised discretization sides of connection attribute data based on comentropy The variant value of connection attribute is considered as discrete event, connection attribute value is obtained by way of calculating comentropy by method Confusion degree, using confusion degree as determine breakpoint number foundation, i.e., according to the comentropy of connection attribute value as breakpoint according to According to.

The specific technical solution that the present invention uses for：

A kind of unsupervised discretization method of connection attribute data based on comentropy, comprises the following steps：

Step 1, the analysis of attribute value：Ergodic data concentrates any connection attribute n_jAll values record, count the category The discrete granularity of property | n_j| and the probability of variant value is calculated, record maximum n_j ^maxWith minimum value n_j ^min；

Step 2, the confusion degree of computation attribute：Any connection attribute in data set is drawn according to the calculation formula of comentropy n_jThe calculation formula of value confusion degree, i.e.,

Wherein, T_jRepresent the confusion degree of j-th of connection attribute, | n_j| represent the discrete granularity of j-th of connection attribute, q_ji Represent the probability of i-th of value of the attribute；The value confusion degree of the attribute is calculated according to formula (1)；

Step 3 calculates breakpoint number：Breakpoint number is obtained to the downward rounding of value confusion degree, i.e.,

Wherein, NumP_jRepresent the breakpoint number of j-th of attribute；

Step 4 determines breakpoint location：Using the width in each section of the method computation partition in wide section, i.e.,

Wherein, w_jRepresent the width in each section,Represent the value range of j-th of attribute, i.e., The interval number of division is NumP_j+1；The position of each breakpoint is determined according to formula (4),

Wherein, DoC_jFor the set of the attribute breakpoint,Each breakpoint is represented, which breakpoint th represents, n_j ^min+(th*w_j) represent the th breakpoint position；The scope in each section is divided according to breakpoint location, each interval range is such as Under：[n_j ^min, n_j ^min+w_j), [n_j ^min+w_j, n_j ^min+2w_j) ... [n_j ^min+(NumP_j*w_j), n_j ^max]；

Step 5, to connection attribute n_jCarry out discretization：Travel through connection attribute n_jAll values record, taken according to each Section where being worth is replaced with the centrifugal pump assigned for place section.

Further, discrete granularity described in step 1 | n_j| it is the number of different values.

Further, in step 1, the probability q of variant value_jiCalculation formula be

q_ji=n_ji/n_{J is total}, wherein, i=1,2 ..., | n_j|, n_jiFor the occurrence number of each different values, n_{J is total}It is taken to be all Value sum, n_jiAnd n_{J is total}It is counted in ergodic data.

Further, each section assigns unique centrifugal pump in step 5, and the centrifugal pump in each section differs.

The variant value of connection attribute is considered as discrete event, calculates each value by the present invention according to information entropy theory Confusion degree, and using confusion degree as the foundation for determining breakpoint number.The mode of this brand-new definite breakpoint number of the present invention, more Former data are adapted to, each attribute discretization is independent of each other, and independent of other attributes, computational efficiency higher.

Description of the drawings

Fig. 1 is the flow diagram of the method for the present invention.

Specific embodiment

The present invention is described in detail below in conjunction with the accompanying drawings.

First, term of the present invention and discretization process are briefly described, readily appreciate the method for the present invention.

Discrete granularity：For given data set, the number of the different values of any attribute is known as the discrete of the attribute Granularity.The discrete granularity of Category Attributes is denoted as | c |, the discrete granularity of connection attribute is denoted as | n |.

Comentropy is the probabilistic theory of metric, is defined as the probability of occurrence of Discrete Stochastic event.It is similar Ground, each different values of connection attribute can be regarded as a Discrete Stochastic event, the discrete granularity of the attribute be equal to it is discrete with The number of machine event.Comentropy by the different values for calculating connection attribute, can characterize the uncertainty of connection attribute i.e. The confusion degree of value can determine the discrete breakpoint number of connection attribute according to the confusion degree of connection attribute.

The calculation formula of comentropy is as follows：

Wherein, m be source symbol different value numbers, p_iRepresent the probability that i-th kind of value occurs.

Discretization for any connection attribute of data set is by its value range(n^minIt represents The minimum value of attribute n, n^maxRepresent the maximum of attribute n) the problem of being divided into k section, the position each divided is known as disconnected Point.The value range of connection attribute is divided into k section, each section centrifugal pump 1,2 ... k are represented；P is breakpoint, altogether There are k-1.It understands, the problem of Discretization for Continuous Attribute is to determine section and breakpoint, is represented by DoC={ K, P }, wherein K is The set in section, K={ 1,2 ..., k }, P are the set of breakpoint, P={ p₁, p₂..., p_k-1}。

The number that total interval number is equal to breakpoint adds 1 i.e. | K |=| P |+1, so the essence of Discretization for Continuous Attribute is appreciated that The problem of to determine breakpoint number and each breakpoint location, i.e.,Which breakpoint th represents, s represents the breakpoint Position.

Next, the step of the method for the present invention, is described in detail.

Step 1, the analysis of attribute value：Ergodic data concentrates any connection attribute n_jAll values record, statistics is all Value sum n_{J is total}, dispersion | n_j| the occurrence number n of (numbers of different values) and each different values_ji, calculate each value Probability q_ji, q_ji=n_ji/n_{J is total}, i=1,2 ..., | n_j|, record maximum n_j ^maxWith minimum value n_j ^min。

Wherein, T_jRepresent the confusion degree of j-th of connection attribute, | n_j| represent the discrete granularity of j-th of connection attribute, i.e., The number of different values, q_jiRepresent the probability of i-th of value of the attribute；The value confusion journey of the attribute is calculated according to formula (1) Degree.

For example, comentropy is minimum when the discrete granularity of certain connection attribute is 1, value 0 represents 0 breakpoint, that is, institute of insertion There is value all in same section；Comentropy is maximum when the discrete granularity of certain connection attribute is D, and value isRepresent insertionA breakpoint, that is, all values are divided inA section.

Step 3 calculates breakpoint number：In most cases, the result obtained by formula (1) is not integer, however of breakpoint Number must be integer.In order to make the i.e. centrifugal pump number of demarcation interval less, by being broken to the downward rounding of value confusion degree Points, i.e.,

Wherein, NumP_jRepresent the breakpoint number of j-th of attribute.

Step 4 determines breakpoint location：It is low in order to simply should be readily appreciated that and calculate consumption, select the method meter in wide section The width in each section of division is calculated, i.e.,

Wherein, w_jRepresent the width in each section,Represent the value range of j-th of attribute, Total interval number of division is NumP_j+1；The position of each breakpoint is determined according to formula (4),

Wherein, DoC_jFor the set of the attribute breakpoint,Each breakpoint is represented, which breakpoint th represents, n_j ^min+(th*w_j) represent the th breakpoint position.

The scope in each section is divided according to breakpoint location, each interval range is as follows：[n_j ^min, n_j ^min+w_j), [n_j ^min+w_j, n_j ^min+2w_j) ... [n_j ^min+(NumP_j*w_j), n_j ^max]。

Step 5, to connection attribute n_jCarry out discretization：According to the interval number of division, for each section assign it is different from Dissipate value.Travel through connection attribute n_jAll values record, according to the section where each value, be replaced with as location Between the centrifugal pump that assigns.

The above are the preferred embodiment of the present invention, can not be used to limit protection scope of the present invention.People in the art The Variations similar that member is made according to embodiment all belongs to the scope of protection of the present invention.

Claims

1. a kind of unsupervised discretization method of connection attribute data based on comentropy, it is characterised in that comprise the following steps：

Step 1, the analysis of attribute value：Ergodic data concentrates any connection attribute n_jAll values record, count the attribute from Shot degree | n_j| and the probability of variant value is calculated, record maximum n_j ^maxWith minimum value n_j ^min；

Step 2, the confusion degree of computation attribute：Any connection attribute n in data set is drawn according to the calculation formula of comentropy_jIt takes It is worth the calculation formula of confusion degree, i.e.,

<mrow> <msub> <mi>T</mi> <mi>j</mi> </msub> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <msub> <mi>n</mi> <mi>j</mi> </msub> <mo>|</mo> </mrow> </munderover> <mo>-</mo> <msub> <mi>q</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <msubsup> <mi>log</mi> <mn>2</mn> <msub> <mi>q</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Wherein, T_jRepresent the confusion degree of j-th of connection attribute, | n_j| represent the discrete granularity of j-th of connection attribute, q_jiIt represents The probability of i-th of value of the attribute；The value confusion degree of the attribute is calculated according to formula (1)；

Wherein, NumP_jRepresent the breakpoint number of j-th of attribute；

Wherein, w_jRepresent the width in each section,Represent the value range of j-th of attribute, i.e.,It draws The interval number divided is NumP_j+1；The position of each breakpoint is determined according to formula (4),

<mrow> <msub> <mi>DoC</mi> <mi>j</mi> </msub> <mo>=</mo> <mo>{</mo> <msubsup> <mi>P</mi> <mrow> <msubsup> <mi>n</mi> <mi>j</mi> <mi>min</mi> </msubsup> <mo>+</mo> <mrow> <mo>(</mo> <mrow> <mi>t</mi> <mi>h</mi> <mo>*</mo> <msub> <mi>w</mi> <mi>j</mi> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msubsup> <mo>|</mo> <mn>1</mn> <mo>&le;</mo> <mi>t</mi> <mi>h</mi> <mo>&le;</mo> <msub> <mi>NumP</mi> <mi>j</mi> </msub> <mo>,</mo> <mi>t</mi> <mi>h</mi> <mo>&Element;</mo> <msup> <mi>N</mi> <mo>*</mo> </msup> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Wherein, DoC_jFor the set of the attribute breakpoint,Each breakpoint is represented, th represents which breakpoint, n_j ^min+ (th*w_j) represent the th breakpoint position；The scope in each section is divided according to breakpoint location, each interval range is as follows： [n_j ^min, n_j ^min+w_j), [n_j ^min+w_j, n_j ^min+2w_j) ... [n_j ^min+(NumP_j*w_j), n_j ^max]；

Step 5, to connection attribute n_jCarry out discretization：Travel through connection attribute n_jAll values record, according to each value institute Section, be replaced with for place section assign centrifugal pump.

2. the connection attribute data unsupervised discretization method according to claim 1 based on comentropy, it is characterised in that： Discrete granularity described in step 1 | n_j| it is the number of different values.

3. the connection attribute data unsupervised discretization method according to claim 1 based on comentropy, it is characterised in that： In step 1, the probability q of variant value_jiCalculation formula be q_ji=n_ji/n_{J is total}, wherein, i=1,2 ..., | n_j|, n_jiTo be every The occurrence number of a difference value, n_{J is total}For all values sum, n_jiAnd n_{J is total}It is counted in ergodic data.

4. the connection attribute data unsupervised discretization method according to claim 1 based on comentropy, it is characterised in that： Each section assigns unique centrifugal pump in step 5, and the centrifugal pump in each section differs.