CN108073553B

CN108073553B - Continuous attribute data unsupervised discretization method based on information entropy

Info

Publication number: CN108073553B
Application number: CN201711450629.6A
Authority: CN
Inventors: 马生俊; 陈旺虎; 郭宏乐; 乔保民; 李新田
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2021-02-12
Anticipated expiration: 2037-12-27
Also published as: CN108073553A

Abstract

The invention relates to the technical field of big data continuous attribute discretization, in particular to an information entropy-based continuous attribute data unsupervised discretization method. The method comprises the following steps: step 1, traversing all value records of any continuous attribute, and counting the discrete granularity | n of the attribute_jAnd the probabilities q of different values_jiRecording the maximum value n_j ^maxAnd a minimum value n_j ^min(ii) a Step 2, obtaining any continuous attribute n according to a calculation formula of the information entropy_jA calculation formula of the value confusion degree, and the value confusion degree of the attribute is calculated according to the formula; step 3, rounding the value confusion degree downwards to obtain the broken point number; step 4, calculating the width of each divided interval by adopting an equal-width interval method, and determining the position of each breakpoint; step 5, for continuous attribute n_jDiscretization is performed. The brand-new mode for determining the number of the broken points is more suitable for original data, discretization of each attribute is not influenced mutually, and the method is independent of other attributes and has higher calculation efficiency.

Description

Continuous attribute data unsupervised discretization method based on information entropy

Technical Field

The invention relates to the technical field of big data continuous attribute discretization, in particular to an information entropy-based continuous attribute data unsupervised discretization method.

Background

The continuous attribute discretization is a process of dividing a value range of the continuous attribute into a plurality of intervals, wherein each interval corresponds to a unique discrete value, and converting an original value into the discrete value. Researchers at home and abroad put forward a great number of methods in the aspect of continuous attribute (numerical attribute) discretization, and various classification methods are provided according to different angles, namely top-down and bottom-up, supervised and unsupervised, integral and local, static and dynamic, single attribute and multiple attributes and the like. The essence of the continuous attribute discretization is to determine the number of discrete values (intervals) and the position of a break point, and from the viewpoint of how to determine the number of discrete values, there are mainly the following methods.

First, a method for a user to subjectively specify the number of discrete values. Typical methods include an equal-width interval method (EWD), an equal-frequency interval method (EFD), a clustering-based discretization method, a representative CADD (Class-Attribute Dependent discretization), and the like, which all require a user to specify a discrete value number K in advance.

Secondly, assume a method in which the number K of discrete values and the number f of records in a section or the width d of the section have a relationship. Typical PD (Proportional Discretizer) method and FIMUS (Rahman and Islam,2014) method, the PD method assumes that the number of discrete values K is equal to the number of records per bin f, and K × f ═ D, which is essentially an equal frequency bin method; the FIMUS method assumes that the number of discrete values, K, is equal to the width, d, of each bin, and K x d ═ min, max ], which is essentially a constant-width bin method. The two methods avoid the user to input parameters, but the condition needs to be assumed in advance, the product of the number of records in a single default interval and the number of intervals is equal to the total number of records or the value range, and the theoretical basis is lacked.

Thirdly, determining the number K of the discrete values according to the classification attributes or the relationship among the discrete attributes. Methods based on such ideas are many, and representative methods include CAIM (Class-Attribute maximum correlation), LFD (Low Frequency discretization), MDLP (Minimum Description Length rule), and the like. CAIM is an improvement of CADD, the method takes a CAIM value as a discrete discriminant to achieve the goal of maximizing the correlation degree of the class and the attribute, and the smallest interval number as possible is generated through a heuristic standard of the correlation of the class attributes; the LFD is improved on the basis of a discretization method based on CAIM or the like in which only the relationship between continuous attributes and classification attributes is considered and a majority method in which the high-frequency value is used as a candidate breakpoint, and in which the relationship with all attributes, such as classification attributes, discrete attributes, and discretized attributes, is considered in discretization of continuous attributes and the low-frequency value is used as a candidate breakpoint; CAIM and LFD like methods create dependencies on classification attributes or other attributes when discretized. The MDLP method is a classical method based on information entropy and minimum length description, recursively selects break points, tries to minimize the information quantity of a model, and determines the number of appropriate discrete values by MDLP; similarly, there are many methods for discretization using the idea of information entropy, but most of them determine whether to merge or split intervals according to some form of entropy.

In the above methods, the method of subjectively assigning the number of discrete values by a user lacks adaptability to original data; the method of assuming the conditions lacks theoretical basis; the discretization process is caused to depend on other attributes through a heuristic method; the method adopting the information entropy does not determine the number of discrete values according to the information entropy of continuous attributes, and the calculation cost is high.

Disclosure of Invention

In order to solve the problems, the invention provides an information entropy-based continuous attribute data unsupervised discretization method, which treats different values of continuous attributes as discrete events, calculates the chaos degree of the values of the continuous attributes in a mode of calculating the information entropy, and takes the chaos degree as a basis for determining the number of break points, namely, the information entropy of the values of the continuous attributes as a break point basis.

The invention adopts the specific technical scheme that:

an information entropy-based continuous attribute data unsupervised discretization method comprises the following steps:

step 1, attribute value analysis: traversing any consecutive attribute n in a dataset_jAll value records of (2) and statistics of the discrete granularity | n of the attribute_jL and calculating the probability of each different value, and recording the maximum value n_j ^maxAnd a minimum value n_j ^min；

Step 2, calculating the chaos degree of the attributes: obtaining any continuous attribute n in the data set according to a calculation formula of information entropy_jA formula for calculating the degree of confusion, i.e.

Wherein, T_jRepresents the degree of disorder, | n, of the jth consecutive attribute_jI denotes the discrete granularity of the jth continuous attribute, q_jiRepresenting the probability of the ith value of the attribute; calculating the value confusion degree of the attribute according to a formula (1);

step 3, calculating the number of the fault points: rounding the value disorder down to obtain the number of broken points, i.e.

Wherein, NumP_jThe number of broken points representing the jth attribute;

step 4, determining the breakpoint position: the width of each divided interval is calculated by adopting a method of equal-width intervals, namely

Wherein, w_jThe width of each of the intervals is indicated,

representing the value range of the jth attribute, i.e.

The number of divided intervals is NumP_j+ 1; the position of each breakpoint is determined according to equation (4),

wherein, DoC_jFor the set of breakpoints for the property,

representing each breakpoint, th represents the number of breakpoints, n_j ^min+(th*w_j) The position of the th breakpoint is shown; dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] of_j ^min，n_j ^min+w_j)，[n_j ^min+w_j，n_j ^min+2w_j)，…[n_j ^min+(NumP_j*w_j)，n_j ^max]；

Step 5, for continuous attribute n_jCarrying out discretization: traverse the continuous attribute n_jAnd replacing all the value records with discrete values given to the intervals according to the intervals where each value is located.

Further, the discrete granularity | n in step 1_jAnd | is the number of different values.

Further, in step 1, the probabilities q of the different values are_jiIs calculated by the formula

q_ji＝n_ji/n_{j Total}Wherein i is 1,2, …, | n_j|，n_jiFor each occurrence of a different value, n_{j Total}For all values, n_jiAnd n_{j Total}And counting while traversing the data.

Further, in step 5, each interval is assigned with a unique discrete value, and the discrete values of the intervals are different.

According to the information entropy theory, different values of continuous attributes are regarded as discrete events, the disorder degree of each value is calculated, and the disorder degree is used as a basis for determining the number of broken points. The brand-new mode for determining the number of the broken points is more suitable for original data, discretization of each attribute is not influenced mutually, and the method is independent of other attributes and has higher calculation efficiency.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

First, the terminology and discretization process involved in the present invention will be briefly described to facilitate understanding of the method of the present invention.

Discrete particle size: for a given data set, the number of different values of any attribute is referred to as the discrete granularity of the attribute. The discrete granularity of a discrete attribute is denoted as | c |, and the discrete granularity of a continuous attribute is denoted as | n |.

Information entropy is a theory that measures the uncertainty of information, which is defined as the probability of occurrence of discrete random events. Similarly, each different value of a continuous attribute may be understood as a discrete random event, with the discrete granularity of the attribute equal to the number of discrete random events. By calculating the information entropies of different values of the continuous attribute, the uncertainty of the continuous attribute, namely the chaos degree of the values, can be represented, and the number of discrete breakpoints of the continuous attribute can be determined according to the chaos degree of the continuous attribute.

The calculation formula of the information entropy is as follows:

wherein m is the number of different values of the source symbol, p_iThe probability of occurrence of the ith value is represented.

The discretization of any continuous attribute of the data set is to take the value range of the continuous attribute

(n^minRepresents the minimum value of the attribute n, n^maxMaximum value representing attribute n) into k intervals, the position of each division being called a breakpoint. Dividing the value range of the continuous attribute into k intervals, wherein each interval is represented by a discrete value 1 and 2 … … k; p is a breakpoint, and k-1 is total. It can be seen that the discretization of the continuous attribute is a problem of determining intervals and breakpoints, and can be expressed as DoC ═ { K, P }, where K is a set of intervals, K ═ 1,2, …, K }, P is a set of breakpoints, and P ═ P₁，p₂，…，p_k-1}。

The total interval number is equal to the number of the break points plus 1, i.e., | K | ═ P | +1, so the essence of the discretization of the continuous attributes can be understood as the problem of determining the number of the break points and the position of each break point, i.e., the problem of determining the number of the break points and the position of each break point

th denotes the number of breakpoints and s denotes the position of the breakpoint.

Next, the steps of the method of the present invention will be described in detail.

step 1, attribute value analysis: traversing any consecutive attribute n in a dataset_jAll the value records are counted, and the total number n of all the values is counted_{j Total}Dispersion | n_jL (number of different values) and the number of occurrences n of each different value_jiCalculating the probability q of each value_ji，q_ji＝n_ji/n_{j Total}，i＝1，2，…，|n_jL, recording the maximum value n_j ^maxAnd a minimum value n_j ^min。

Wherein, T_jRepresents the degree of disorder, | n, of the jth consecutive attribute_jI represents the discrete granularity of the jth continuous attribute, namely the number of different values, q_jiRepresenting the probability of the ith value of the attribute; and calculating the value confusion degree of the attribute according to the formula (1).

For example, when the discrete granularity of a certain continuous attribute is 1, the information entropy is minimum, the value is 0, and it means that 0 break point is inserted, that is, all values are in the same interval; the maximum entropy of information when the discrete granularity of a certain continuous attribute is D, and the value is

Indicating insertion

A point of break isHave values divided in

And (4) each interval.

Step 3, calculating the number of the fault points: in most cases, the result obtained by equation (1) is not an integer, but the number of breakpoints must be an integer. In order to reduce the number of the divided regions, i.e. discrete values, the number of the broken points is obtained by rounding the disordered degree of the values downwards, i.e.

Wherein, NumP_jThe number of broken points representing the jth attribute.

Step 4, determining the breakpoint position: for simplicity and easy understanding and low computational consumption, the method of selecting equal-width intervals calculates the width of each interval of the partition, i.e., the width of each interval is calculated

Wherein, w_jThe width of each of the intervals is indicated,

the value range of the jth attribute is represented,

the total number of divided intervals is NumP_j+ 1; the position of each breakpoint is determined according to equation (4),

wherein, DoC_jFor the set of breakpoints for the property,

representing each breakpoint, th represents the number of breakpoints, n_j ^min+(th*w_j) Indicating the location of the th breakpoint.

Dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] of_j ^min，n_j ^min+w_j)，[n_j ^min+w_j，n_j ^min+2w_j)，…[n_j ^min+(NumP_j*w_j)，n_j ^max]。

Step 5, for continuous attribute n_jCarrying out discretization: according to the number of divided intervals, different discrete values are given to each interval. Traverse the continuous attribute n_jAnd replacing all the value records with discrete values given to the intervals according to the intervals where each value is located.

The above are preferred embodiments of the present invention and should not be used to limit the scope of the present invention. Similar modifications of the embodiments will occur to those skilled in the art and are intended to be included within the scope of the present invention.

Claims

1. An information entropy-based continuous attribute data unsupervised discretization method is characterized by comprising the following steps of:

step 1, attribute value analysis: traversing any consecutive attribute n in a dataset_jAll value records of (2) and statistics of the discrete granularity | n of the attribute_jL and calculating the probability of each different value, and recording the continuous attribute n_jMaximum n of all values_j ^maxAnd a minimum value n_j ^min；

Wherein, NumP_jThe number of broken points representing the jth attribute;

Wherein, w_jThe width of each of the intervals is indicated,

representing the value range of the jth attribute, i.e.

wherein, DoC_jFor the set of breakpoints for the property,

representing each breakpoint, th represents the number of breakpoints, n_j ^min+(th*w_j) Indicating the position of the th breakpoint, N^*Represents a positive integer; dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] of_j ^min，n_j ^min+w_j)，[n_j ^min+w_j，n_j ^min+2w_j)，…[n_j ^min+(NumP_j*w_j)，n_j ^max]；

Step 5, for continuous attribute n_jCarrying out discretization: traverse the continuous attribute n_jAnd replacing each value with a discrete value given by the corresponding interval according to the interval where the value is located.

2. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: step 1 the discrete particle size | n_jAnd | is the number of different values.

3. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: in step 1, the probabilities q of the different values_jiIs calculated as q_ji＝n_ji/n_{j Total}Wherein i is 1,2, …, | n_j|，n_jiFor each occurrence of a different value, n_{j Total}For all values, n_jiAnd n_{j Total}And counting while traversing the data.

4. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: in the step 5, each interval is endowed with a unique discrete value, and the discrete values of the intervals are different.