CN108073553B - Continuous attribute data unsupervised discretization method based on information entropy - Google Patents

Continuous attribute data unsupervised discretization method based on information entropy Download PDF

Info

Publication number
CN108073553B
CN108073553B CN201711450629.6A CN201711450629A CN108073553B CN 108073553 B CN108073553 B CN 108073553B CN 201711450629 A CN201711450629 A CN 201711450629A CN 108073553 B CN108073553 B CN 108073553B
Authority
CN
China
Prior art keywords
attribute
value
discrete
continuous
information entropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711450629.6A
Other languages
Chinese (zh)
Other versions
CN108073553A (en
Inventor
马生俊
陈旺虎
郭宏乐
乔保民
李新田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN201711450629.6A priority Critical patent/CN108073553B/en
Publication of CN108073553A publication Critical patent/CN108073553A/en
Application granted granted Critical
Publication of CN108073553B publication Critical patent/CN108073553B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Algebra (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of big data continuous attribute discretization, in particular to an information entropy-based continuous attribute data unsupervised discretization method. The method comprises the following steps: step 1, traversing all value records of any continuous attribute, and counting the discrete granularity | n of the attributejAnd the probabilities q of different valuesjiRecording the maximum value nj maxAnd a minimum value nj min(ii) a Step 2, obtaining any continuous attribute n according to a calculation formula of the information entropyjA calculation formula of the value confusion degree, and the value confusion degree of the attribute is calculated according to the formula; step 3, rounding the value confusion degree downwards to obtain the broken point number; step 4, calculating the width of each divided interval by adopting an equal-width interval method, and determining the position of each breakpoint; step 5, for continuous attribute njDiscretization is performed. The brand-new mode for determining the number of the broken points is more suitable for original data, discretization of each attribute is not influenced mutually, and the method is independent of other attributes and has higher calculation efficiency.

Description

Continuous attribute data unsupervised discretization method based on information entropy
Technical Field
The invention relates to the technical field of big data continuous attribute discretization, in particular to an information entropy-based continuous attribute data unsupervised discretization method.
Background
The continuous attribute discretization is a process of dividing a value range of the continuous attribute into a plurality of intervals, wherein each interval corresponds to a unique discrete value, and converting an original value into the discrete value. Researchers at home and abroad put forward a great number of methods in the aspect of continuous attribute (numerical attribute) discretization, and various classification methods are provided according to different angles, namely top-down and bottom-up, supervised and unsupervised, integral and local, static and dynamic, single attribute and multiple attributes and the like. The essence of the continuous attribute discretization is to determine the number of discrete values (intervals) and the position of a break point, and from the viewpoint of how to determine the number of discrete values, there are mainly the following methods.
First, a method for a user to subjectively specify the number of discrete values. Typical methods include an equal-width interval method (EWD), an equal-frequency interval method (EFD), a clustering-based discretization method, a representative CADD (Class-Attribute Dependent discretization), and the like, which all require a user to specify a discrete value number K in advance.
Secondly, assume a method in which the number K of discrete values and the number f of records in a section or the width d of the section have a relationship. Typical PD (Proportional Discretizer) method and FIMUS (Rahman and Islam,2014) method, the PD method assumes that the number of discrete values K is equal to the number of records per bin f, and K × f ═ D, which is essentially an equal frequency bin method; the FIMUS method assumes that the number of discrete values, K, is equal to the width, d, of each bin, and K x d ═ min, max ], which is essentially a constant-width bin method. The two methods avoid the user to input parameters, but the condition needs to be assumed in advance, the product of the number of records in a single default interval and the number of intervals is equal to the total number of records or the value range, and the theoretical basis is lacked.
Thirdly, determining the number K of the discrete values according to the classification attributes or the relationship among the discrete attributes. Methods based on such ideas are many, and representative methods include CAIM (Class-Attribute maximum correlation), LFD (Low Frequency discretization), MDLP (Minimum Description Length rule), and the like. CAIM is an improvement of CADD, the method takes a CAIM value as a discrete discriminant to achieve the goal of maximizing the correlation degree of the class and the attribute, and the smallest interval number as possible is generated through a heuristic standard of the correlation of the class attributes; the LFD is improved on the basis of a discretization method based on CAIM or the like in which only the relationship between continuous attributes and classification attributes is considered and a majority method in which the high-frequency value is used as a candidate breakpoint, and in which the relationship with all attributes, such as classification attributes, discrete attributes, and discretized attributes, is considered in discretization of continuous attributes and the low-frequency value is used as a candidate breakpoint; CAIM and LFD like methods create dependencies on classification attributes or other attributes when discretized. The MDLP method is a classical method based on information entropy and minimum length description, recursively selects break points, tries to minimize the information quantity of a model, and determines the number of appropriate discrete values by MDLP; similarly, there are many methods for discretization using the idea of information entropy, but most of them determine whether to merge or split intervals according to some form of entropy.
In the above methods, the method of subjectively assigning the number of discrete values by a user lacks adaptability to original data; the method of assuming the conditions lacks theoretical basis; the discretization process is caused to depend on other attributes through a heuristic method; the method adopting the information entropy does not determine the number of discrete values according to the information entropy of continuous attributes, and the calculation cost is high.
Disclosure of Invention
In order to solve the problems, the invention provides an information entropy-based continuous attribute data unsupervised discretization method, which treats different values of continuous attributes as discrete events, calculates the chaos degree of the values of the continuous attributes in a mode of calculating the information entropy, and takes the chaos degree as a basis for determining the number of break points, namely, the information entropy of the values of the continuous attributes as a break point basis.
The invention adopts the specific technical scheme that:
an information entropy-based continuous attribute data unsupervised discretization method comprises the following steps:
step 1, attribute value analysis: traversing any consecutive attribute n in a datasetjAll value records of (2) and statistics of the discrete granularity | n of the attributejL and calculating the probability of each different value, and recording the maximum value nj maxAnd a minimum value nj min
Step 2, calculating the chaos degree of the attributes: obtaining any continuous attribute n in the data set according to a calculation formula of information entropyjA formula for calculating the degree of confusion, i.e.
Figure BDA0001528420170000031
Wherein, TjRepresents the degree of disorder, | n, of the jth consecutive attributejI denotes the discrete granularity of the jth continuous attribute, qjiRepresenting the probability of the ith value of the attribute; calculating the value confusion degree of the attribute according to a formula (1);
step 3, calculating the number of the fault points: rounding the value disorder down to obtain the number of broken points, i.e.
Figure BDA0001528420170000032
Wherein, NumPjThe number of broken points representing the jth attribute;
step 4, determining the breakpoint position: the width of each divided interval is calculated by adopting a method of equal-width intervals, namely
Figure BDA0001528420170000041
Wherein, wjThe width of each of the intervals is indicated,
Figure BDA0001528420170000042
representing the value range of the jth attribute, i.e.
Figure BDA0001528420170000043
The number of divided intervals is NumPj+ 1; the position of each breakpoint is determined according to equation (4),
Figure BDA0001528420170000044
wherein, DoCjFor the set of breakpoints for the property,
Figure BDA0001528420170000045
representing each breakpoint, th represents the number of breakpoints, nj min+(th*wj) The position of the th breakpoint is shown; dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] ofj min,nj min+wj),[nj min+wj,nj min+2wj),…[nj min+(NumPj*wj),nj max];
Step 5, for continuous attribute njCarrying out discretization: traverse the continuous attribute njAnd replacing all the value records with discrete values given to the intervals according to the intervals where each value is located.
Further, the discrete granularity | n in step 1jAnd | is the number of different values.
Further, in step 1, the probabilities q of the different values arejiIs calculated by the formula
qji=nji/nj TotalWherein i is 1,2, …, | nj|,njiFor each occurrence of a different value, nj TotalFor all values, njiAnd nj TotalAnd counting while traversing the data.
Further, in step 5, each interval is assigned with a unique discrete value, and the discrete values of the intervals are different.
According to the information entropy theory, different values of continuous attributes are regarded as discrete events, the disorder degree of each value is calculated, and the disorder degree is used as a basis for determining the number of broken points. The brand-new mode for determining the number of the broken points is more suitable for original data, discretization of each attribute is not influenced mutually, and the method is independent of other attributes and has higher calculation efficiency.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
First, the terminology and discretization process involved in the present invention will be briefly described to facilitate understanding of the method of the present invention.
Discrete particle size: for a given data set, the number of different values of any attribute is referred to as the discrete granularity of the attribute. The discrete granularity of a discrete attribute is denoted as | c |, and the discrete granularity of a continuous attribute is denoted as | n |.
Information entropy is a theory that measures the uncertainty of information, which is defined as the probability of occurrence of discrete random events. Similarly, each different value of a continuous attribute may be understood as a discrete random event, with the discrete granularity of the attribute equal to the number of discrete random events. By calculating the information entropies of different values of the continuous attribute, the uncertainty of the continuous attribute, namely the chaos degree of the values, can be represented, and the number of discrete breakpoints of the continuous attribute can be determined according to the chaos degree of the continuous attribute.
The calculation formula of the information entropy is as follows:
Figure BDA0001528420170000051
wherein m is the number of different values of the source symbol, piThe probability of occurrence of the ith value is represented.
The discretization of any continuous attribute of the data set is to take the value range of the continuous attribute
Figure BDA0001528420170000052
(nminRepresents the minimum value of the attribute n, nmaxMaximum value representing attribute n) into k intervals, the position of each division being called a breakpoint. Dividing the value range of the continuous attribute into k intervals, wherein each interval is represented by a discrete value 1 and 2 … … k; p is a breakpoint, and k-1 is total. It can be seen that the discretization of the continuous attribute is a problem of determining intervals and breakpoints, and can be expressed as DoC ═ { K, P }, where K is a set of intervals, K ═ 1,2, …, K }, P is a set of breakpoints, and P ═ P1,p2,…,pk-1}。
The total interval number is equal to the number of the break points plus 1, i.e., | K | ═ P | +1, so the essence of the discretization of the continuous attributes can be understood as the problem of determining the number of the break points and the position of each break point, i.e., the problem of determining the number of the break points and the position of each break point
Figure BDA0001528420170000061
th denotes the number of breakpoints and s denotes the position of the breakpoint.
Next, the steps of the method of the present invention will be described in detail.
An information entropy-based continuous attribute data unsupervised discretization method comprises the following steps:
step 1, attribute value analysis: traversing any consecutive attribute n in a datasetjAll the value records are counted, and the total number n of all the values is countedj TotalDispersion | njL (number of different values) and the number of occurrences n of each different valuejiCalculating the probability q of each valueji,qji=nji/nj Total,i=1,2,…,|njL, recording the maximum value nj maxAnd a minimum value nj min
Step 2, calculating the chaos degree of the attributes: obtaining any continuous attribute n in the data set according to a calculation formula of information entropyjA formula for calculating the degree of confusion, i.e.
Figure BDA0001528420170000062
Wherein, TjRepresents the degree of disorder, | n, of the jth consecutive attributejI represents the discrete granularity of the jth continuous attribute, namely the number of different values, qjiRepresenting the probability of the ith value of the attribute; and calculating the value confusion degree of the attribute according to the formula (1).
For example, when the discrete granularity of a certain continuous attribute is 1, the information entropy is minimum, the value is 0, and it means that 0 break point is inserted, that is, all values are in the same interval; the maximum entropy of information when the discrete granularity of a certain continuous attribute is D, and the value is
Figure BDA0001528420170000071
Indicating insertion
Figure BDA0001528420170000072
A point of break isHave values divided in
Figure BDA0001528420170000073
And (4) each interval.
Step 3, calculating the number of the fault points: in most cases, the result obtained by equation (1) is not an integer, but the number of breakpoints must be an integer. In order to reduce the number of the divided regions, i.e. discrete values, the number of the broken points is obtained by rounding the disordered degree of the values downwards, i.e.
Figure BDA0001528420170000074
Wherein, NumPjThe number of broken points representing the jth attribute.
Step 4, determining the breakpoint position: for simplicity and easy understanding and low computational consumption, the method of selecting equal-width intervals calculates the width of each interval of the partition, i.e., the width of each interval is calculated
Figure BDA0001528420170000075
Wherein, wjThe width of each of the intervals is indicated,
Figure BDA0001528420170000076
the value range of the jth attribute is represented,
Figure BDA0001528420170000077
the total number of divided intervals is NumPj+ 1; the position of each breakpoint is determined according to equation (4),
Figure BDA0001528420170000078
wherein, DoCjFor the set of breakpoints for the property,
Figure BDA0001528420170000079
representing each breakpoint, th represents the number of breakpoints, nj min+(th*wj) Indicating the location of the th breakpoint.
Dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] ofj min,nj min+wj),[nj min+wj,nj min+2wj),…[nj min+(NumPj*wj),nj max]。
Step 5, for continuous attribute njCarrying out discretization: according to the number of divided intervals, different discrete values are given to each interval. Traverse the continuous attribute njAnd replacing all the value records with discrete values given to the intervals according to the intervals where each value is located.
The above are preferred embodiments of the present invention and should not be used to limit the scope of the present invention. Similar modifications of the embodiments will occur to those skilled in the art and are intended to be included within the scope of the present invention.

Claims (4)

1. An information entropy-based continuous attribute data unsupervised discretization method is characterized by comprising the following steps of:
step 1, attribute value analysis: traversing any consecutive attribute n in a datasetjAll value records of (2) and statistics of the discrete granularity | n of the attributejL and calculating the probability of each different value, and recording the continuous attribute njMaximum n of all valuesj maxAnd a minimum value nj min
Step 2, calculating the chaos degree of the attributes: obtaining any continuous attribute n in the data set according to a calculation formula of information entropyjA formula for calculating the degree of confusion, i.e.
Figure FDA0002795252810000011
Wherein, TjRepresents the degree of disorder, | n, of the jth consecutive attributejI denotes the discrete granularity of the jth continuous attribute, qjiRepresenting the probability of the ith value of the attribute; calculating the value confusion degree of the attribute according to a formula (1);
step 3, calculating the number of the fault points: rounding the value disorder down to obtain the number of broken points, i.e.
Figure FDA0002795252810000012
Wherein, NumPjThe number of broken points representing the jth attribute;
step 4, determining the breakpoint position: the width of each divided interval is calculated by adopting a method of equal-width intervals, namely
Figure FDA0002795252810000013
Wherein, wjThe width of each of the intervals is indicated,
Figure FDA0002795252810000014
representing the value range of the jth attribute, i.e.
Figure FDA0002795252810000015
The number of divided intervals is NumPj+ 1; the position of each breakpoint is determined according to equation (4),
Figure FDA0002795252810000016
wherein, DoCjFor the set of breakpoints for the property,
Figure FDA0002795252810000021
representing each breakpoint, th represents the number of breakpoints, nj min+(th*wj) Indicating the position of the th breakpoint, N*Represents a positive integer; dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] ofj min,nj min+wj),[nj min+wj,nj min+2wj),…[nj min+(NumPj*wj),nj max];
Step 5, for continuous attribute njCarrying out discretization: traverse the continuous attribute njAnd replacing each value with a discrete value given by the corresponding interval according to the interval where the value is located.
2. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: step 1 the discrete particle size | njAnd | is the number of different values.
3. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: in step 1, the probabilities q of the different valuesjiIs calculated as qji=nji/nj TotalWherein i is 1,2, …, | nj|,njiFor each occurrence of a different value, nj TotalFor all values, njiAnd nj TotalAnd counting while traversing the data.
4. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: in the step 5, each interval is endowed with a unique discrete value, and the discrete values of the intervals are different.
CN201711450629.6A 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy Expired - Fee Related CN108073553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711450629.6A CN108073553B (en) 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711450629.6A CN108073553B (en) 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy

Publications (2)

Publication Number Publication Date
CN108073553A CN108073553A (en) 2018-05-25
CN108073553B true CN108073553B (en) 2021-02-12

Family

ID=62155501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711450629.6A Expired - Fee Related CN108073553B (en) 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy

Country Status (1)

Country Link
CN (1) CN108073553B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447163B (en) * 2018-11-01 2022-03-22 中南大学 Radar signal data-oriented moving object detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026399A (en) * 1997-05-30 2000-02-15 Silicon Graphics, Inc. System and method for selection of important attributes
CN101777039A (en) * 2009-11-20 2010-07-14 大连理工大学 Continuous attribute discretization method based on Chi2 statistics
CN102096672A (en) * 2009-12-09 2011-06-15 西安邮电学院 Method for extracting classification rule based on fuzzy-rough model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344890A1 (en) * 2016-05-26 2017-11-30 Arun Kumar Parayatham Distributed algorithm to find reliable, significant and relevant patterns in large data sets

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026399A (en) * 1997-05-30 2000-02-15 Silicon Graphics, Inc. System and method for selection of important attributes
CN101777039A (en) * 2009-11-20 2010-07-14 大连理工大学 Continuous attribute discretization method based on Chi2 statistics
CN102096672A (en) * 2009-12-09 2011-06-15 西安邮电学院 Method for extracting classification rule based on fuzzy-rough model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
云环境下精准扶贫数据的异常检测研究;马生俊;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190615(第06期);论文第19-27页 *
连续属性离散化算法研究综述;张钰莎;《计算机应用与软件》;20140831;第31卷(第8期);第6-8,140页 *

Also Published As

Publication number Publication date
CN108073553A (en) 2018-05-25

Similar Documents

Publication Publication Date Title
Vreeken et al. Krimp: mining itemsets that compress
Shin et al. Sweg: Lossless and lossy summarization of web-scale graphs
Chen et al. Mining frequent patterns in a varying-size sliding window of online transactional data streams
JP6503679B2 (en) Filter rule creation device, filter rule creation method, and program
WO2013086931A1 (en) Event mining in social networks
Kim et al. Mining high utility itemsets based on the time decaying model
US20220114181A1 (en) Fingerprints for compressed columnar data search
CN108073553B (en) Continuous attribute data unsupervised discretization method based on information entropy
Yun et al. Efficient mining of robust closed weighted sequential patterns without information loss
Sukhija et al. Topic modeling and visualization for big data in social sciences
Wang et al. Negative sequence analysis: A review
CN110928925A (en) Frequent item set mining method and device, storage medium and electronic equipment
Gajawada et al. Projected clustering using particle swarm optimization
Oreški et al. Comparison of feature selection techniques in knowledge discovery process
Li et al. An analysis of network performance degradation induced by workload fluctuations
Vahdat et al. On the application of GP to streaming data classification tasks with label budgets
CN111723089A (en) Method and device for processing data based on columnar storage format
Li et al. A single-scan algorithm for mining sequential patterns from data streams
Goyal et al. AnyFI: An anytime frequent itemset mining algorithm for data streams
Sorgente et al. The reaction of a network: Exploring the relationship between the bitcoin network structure and the bitcoin price
Manike et al. Modified GUIDE (LM) algorithm for mining maximal high utility patterns from data streams
Cao et al. An algorithm for outlier detection on uncertain data stream
Ham et al. MBiS: an efficient method for mining frequent weighted utility itemsets from quantitative databases
Ikonomovska et al. Algorithmic techniques for processing data streams
Qin et al. An improved genetic clustering algorithm for categorical data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210212

Termination date: 20211227

CF01 Termination of patent right due to non-payment of annual fee