CN108073553B - Continuous attribute data unsupervised discretization method based on information entropy - Google Patents
Continuous attribute data unsupervised discretization method based on information entropy Download PDFInfo
- Publication number
- CN108073553B CN108073553B CN201711450629.6A CN201711450629A CN108073553B CN 108073553 B CN108073553 B CN 108073553B CN 201711450629 A CN201711450629 A CN 201711450629A CN 108073553 B CN108073553 B CN 108073553B
- Authority
- CN
- China
- Prior art keywords
- attribute
- value
- discrete
- continuous
- information entropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Algebra (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of big data continuous attribute discretization, in particular to an information entropy-based continuous attribute data unsupervised discretization method. The method comprises the following steps: step 1, traversing all value records of any continuous attribute, and counting the discrete granularity | n of the attributejAnd the probabilities q of different valuesjiRecording the maximum value nj maxAnd a minimum value nj min(ii) a Step 2, obtaining any continuous attribute n according to a calculation formula of the information entropyjA calculation formula of the value confusion degree, and the value confusion degree of the attribute is calculated according to the formula; step 3, rounding the value confusion degree downwards to obtain the broken point number; step 4, calculating the width of each divided interval by adopting an equal-width interval method, and determining the position of each breakpoint; step 5, for continuous attribute njDiscretization is performed. The brand-new mode for determining the number of the broken points is more suitable for original data, discretization of each attribute is not influenced mutually, and the method is independent of other attributes and has higher calculation efficiency.
Description
Technical Field
The invention relates to the technical field of big data continuous attribute discretization, in particular to an information entropy-based continuous attribute data unsupervised discretization method.
Background
The continuous attribute discretization is a process of dividing a value range of the continuous attribute into a plurality of intervals, wherein each interval corresponds to a unique discrete value, and converting an original value into the discrete value. Researchers at home and abroad put forward a great number of methods in the aspect of continuous attribute (numerical attribute) discretization, and various classification methods are provided according to different angles, namely top-down and bottom-up, supervised and unsupervised, integral and local, static and dynamic, single attribute and multiple attributes and the like. The essence of the continuous attribute discretization is to determine the number of discrete values (intervals) and the position of a break point, and from the viewpoint of how to determine the number of discrete values, there are mainly the following methods.
First, a method for a user to subjectively specify the number of discrete values. Typical methods include an equal-width interval method (EWD), an equal-frequency interval method (EFD), a clustering-based discretization method, a representative CADD (Class-Attribute Dependent discretization), and the like, which all require a user to specify a discrete value number K in advance.
Secondly, assume a method in which the number K of discrete values and the number f of records in a section or the width d of the section have a relationship. Typical PD (Proportional Discretizer) method and FIMUS (Rahman and Islam,2014) method, the PD method assumes that the number of discrete values K is equal to the number of records per bin f, and K × f ═ D, which is essentially an equal frequency bin method; the FIMUS method assumes that the number of discrete values, K, is equal to the width, d, of each bin, and K x d ═ min, max ], which is essentially a constant-width bin method. The two methods avoid the user to input parameters, but the condition needs to be assumed in advance, the product of the number of records in a single default interval and the number of intervals is equal to the total number of records or the value range, and the theoretical basis is lacked.
Thirdly, determining the number K of the discrete values according to the classification attributes or the relationship among the discrete attributes. Methods based on such ideas are many, and representative methods include CAIM (Class-Attribute maximum correlation), LFD (Low Frequency discretization), MDLP (Minimum Description Length rule), and the like. CAIM is an improvement of CADD, the method takes a CAIM value as a discrete discriminant to achieve the goal of maximizing the correlation degree of the class and the attribute, and the smallest interval number as possible is generated through a heuristic standard of the correlation of the class attributes; the LFD is improved on the basis of a discretization method based on CAIM or the like in which only the relationship between continuous attributes and classification attributes is considered and a majority method in which the high-frequency value is used as a candidate breakpoint, and in which the relationship with all attributes, such as classification attributes, discrete attributes, and discretized attributes, is considered in discretization of continuous attributes and the low-frequency value is used as a candidate breakpoint; CAIM and LFD like methods create dependencies on classification attributes or other attributes when discretized. The MDLP method is a classical method based on information entropy and minimum length description, recursively selects break points, tries to minimize the information quantity of a model, and determines the number of appropriate discrete values by MDLP; similarly, there are many methods for discretization using the idea of information entropy, but most of them determine whether to merge or split intervals according to some form of entropy.
In the above methods, the method of subjectively assigning the number of discrete values by a user lacks adaptability to original data; the method of assuming the conditions lacks theoretical basis; the discretization process is caused to depend on other attributes through a heuristic method; the method adopting the information entropy does not determine the number of discrete values according to the information entropy of continuous attributes, and the calculation cost is high.
Disclosure of Invention
In order to solve the problems, the invention provides an information entropy-based continuous attribute data unsupervised discretization method, which treats different values of continuous attributes as discrete events, calculates the chaos degree of the values of the continuous attributes in a mode of calculating the information entropy, and takes the chaos degree as a basis for determining the number of break points, namely, the information entropy of the values of the continuous attributes as a break point basis.
The invention adopts the specific technical scheme that:
an information entropy-based continuous attribute data unsupervised discretization method comprises the following steps:
step 1, attribute value analysis: traversing any consecutive attribute n in a datasetjAll value records of (2) and statistics of the discrete granularity | n of the attributejL and calculating the probability of each different value, and recording the maximum value nj maxAnd a minimum value nj min;
Step 2, calculating the chaos degree of the attributes: obtaining any continuous attribute n in the data set according to a calculation formula of information entropyjA formula for calculating the degree of confusion, i.e.
Wherein, TjRepresents the degree of disorder, | n, of the jth consecutive attributejI denotes the discrete granularity of the jth continuous attribute, qjiRepresenting the probability of the ith value of the attribute; calculating the value confusion degree of the attribute according to a formula (1);
step 3, calculating the number of the fault points: rounding the value disorder down to obtain the number of broken points, i.e.
Wherein, NumPjThe number of broken points representing the jth attribute;
step 4, determining the breakpoint position: the width of each divided interval is calculated by adopting a method of equal-width intervals, namely
Wherein, wjThe width of each of the intervals is indicated,representing the value range of the jth attribute, i.e.The number of divided intervals is NumPj+ 1; the position of each breakpoint is determined according to equation (4),
wherein, DoCjFor the set of breakpoints for the property,representing each breakpoint, th represents the number of breakpoints, nj min+(th*wj) The position of the th breakpoint is shown; dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] ofj min,nj min+wj),[nj min+wj,nj min+2wj),…[nj min+(NumPj*wj),nj max];
Step 5, for continuous attribute njCarrying out discretization: traverse the continuous attribute njAnd replacing all the value records with discrete values given to the intervals according to the intervals where each value is located.
Further, the discrete granularity | n in step 1jAnd | is the number of different values.
Further, in step 1, the probabilities q of the different values arejiIs calculated by the formula
qji=nji/nj TotalWherein i is 1,2, …, | nj|,njiFor each occurrence of a different value, nj TotalFor all values, njiAnd nj TotalAnd counting while traversing the data.
Further, in step 5, each interval is assigned with a unique discrete value, and the discrete values of the intervals are different.
According to the information entropy theory, different values of continuous attributes are regarded as discrete events, the disorder degree of each value is calculated, and the disorder degree is used as a basis for determining the number of broken points. The brand-new mode for determining the number of the broken points is more suitable for original data, discretization of each attribute is not influenced mutually, and the method is independent of other attributes and has higher calculation efficiency.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
First, the terminology and discretization process involved in the present invention will be briefly described to facilitate understanding of the method of the present invention.
Discrete particle size: for a given data set, the number of different values of any attribute is referred to as the discrete granularity of the attribute. The discrete granularity of a discrete attribute is denoted as | c |, and the discrete granularity of a continuous attribute is denoted as | n |.
Information entropy is a theory that measures the uncertainty of information, which is defined as the probability of occurrence of discrete random events. Similarly, each different value of a continuous attribute may be understood as a discrete random event, with the discrete granularity of the attribute equal to the number of discrete random events. By calculating the information entropies of different values of the continuous attribute, the uncertainty of the continuous attribute, namely the chaos degree of the values, can be represented, and the number of discrete breakpoints of the continuous attribute can be determined according to the chaos degree of the continuous attribute.
The calculation formula of the information entropy is as follows:
wherein m is the number of different values of the source symbol, piThe probability of occurrence of the ith value is represented.
The discretization of any continuous attribute of the data set is to take the value range of the continuous attribute(nminRepresents the minimum value of the attribute n, nmaxMaximum value representing attribute n) into k intervals, the position of each division being called a breakpoint. Dividing the value range of the continuous attribute into k intervals, wherein each interval is represented by a discrete value 1 and 2 … … k; p is a breakpoint, and k-1 is total. It can be seen that the discretization of the continuous attribute is a problem of determining intervals and breakpoints, and can be expressed as DoC ═ { K, P }, where K is a set of intervals, K ═ 1,2, …, K }, P is a set of breakpoints, and P ═ P1,p2,…,pk-1}。
The total interval number is equal to the number of the break points plus 1, i.e., | K | ═ P | +1, so the essence of the discretization of the continuous attributes can be understood as the problem of determining the number of the break points and the position of each break point, i.e., the problem of determining the number of the break points and the position of each break pointth denotes the number of breakpoints and s denotes the position of the breakpoint.
Next, the steps of the method of the present invention will be described in detail.
An information entropy-based continuous attribute data unsupervised discretization method comprises the following steps:
step 1, attribute value analysis: traversing any consecutive attribute n in a datasetjAll the value records are counted, and the total number n of all the values is countedj TotalDispersion | njL (number of different values) and the number of occurrences n of each different valuejiCalculating the probability q of each valueji,qji=nji/nj Total,i=1,2,…,|njL, recording the maximum value nj maxAnd a minimum value nj min。
Step 2, calculating the chaos degree of the attributes: obtaining any continuous attribute n in the data set according to a calculation formula of information entropyjA formula for calculating the degree of confusion, i.e.
Wherein, TjRepresents the degree of disorder, | n, of the jth consecutive attributejI represents the discrete granularity of the jth continuous attribute, namely the number of different values, qjiRepresenting the probability of the ith value of the attribute; and calculating the value confusion degree of the attribute according to the formula (1).
For example, when the discrete granularity of a certain continuous attribute is 1, the information entropy is minimum, the value is 0, and it means that 0 break point is inserted, that is, all values are in the same interval; the maximum entropy of information when the discrete granularity of a certain continuous attribute is D, and the value isIndicating insertionA point of break isHave values divided inAnd (4) each interval.
Step 3, calculating the number of the fault points: in most cases, the result obtained by equation (1) is not an integer, but the number of breakpoints must be an integer. In order to reduce the number of the divided regions, i.e. discrete values, the number of the broken points is obtained by rounding the disordered degree of the values downwards, i.e.
Wherein, NumPjThe number of broken points representing the jth attribute.
Step 4, determining the breakpoint position: for simplicity and easy understanding and low computational consumption, the method of selecting equal-width intervals calculates the width of each interval of the partition, i.e., the width of each interval is calculated
Wherein, wjThe width of each of the intervals is indicated,the value range of the jth attribute is represented,the total number of divided intervals is NumPj+ 1; the position of each breakpoint is determined according to equation (4),
wherein, DoCjFor the set of breakpoints for the property,representing each breakpoint, th represents the number of breakpoints, nj min+(th*wj) Indicating the location of the th breakpoint.
Dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] ofj min,nj min+wj),[nj min+wj,nj min+2wj),…[nj min+(NumPj*wj),nj max]。
Step 5, for continuous attribute njCarrying out discretization: according to the number of divided intervals, different discrete values are given to each interval. Traverse the continuous attribute njAnd replacing all the value records with discrete values given to the intervals according to the intervals where each value is located.
The above are preferred embodiments of the present invention and should not be used to limit the scope of the present invention. Similar modifications of the embodiments will occur to those skilled in the art and are intended to be included within the scope of the present invention.
Claims (4)
1. An information entropy-based continuous attribute data unsupervised discretization method is characterized by comprising the following steps of:
step 1, attribute value analysis: traversing any consecutive attribute n in a datasetjAll value records of (2) and statistics of the discrete granularity | n of the attributejL and calculating the probability of each different value, and recording the continuous attribute njMaximum n of all valuesj maxAnd a minimum value nj min;
Step 2, calculating the chaos degree of the attributes: obtaining any continuous attribute n in the data set according to a calculation formula of information entropyjA formula for calculating the degree of confusion, i.e.
Wherein, TjRepresents the degree of disorder, | n, of the jth consecutive attributejI denotes the discrete granularity of the jth continuous attribute, qjiRepresenting the probability of the ith value of the attribute; calculating the value confusion degree of the attribute according to a formula (1);
step 3, calculating the number of the fault points: rounding the value disorder down to obtain the number of broken points, i.e.
Wherein, NumPjThe number of broken points representing the jth attribute;
step 4, determining the breakpoint position: the width of each divided interval is calculated by adopting a method of equal-width intervals, namely
Wherein, wjThe width of each of the intervals is indicated,representing the value range of the jth attribute, i.e.The number of divided intervals is NumPj+ 1; the position of each breakpoint is determined according to equation (4),
wherein, DoCjFor the set of breakpoints for the property,representing each breakpoint, th represents the number of breakpoints, nj min+(th*wj) Indicating the position of the th breakpoint, N*Represents a positive integer; dividing the range of each interval according to the breakpoint position, wherein the range of each interval is as follows: [ n ] ofj min,nj min+wj),[nj min+wj,nj min+2wj),…[nj min+(NumPj*wj),nj max];
Step 5, for continuous attribute njCarrying out discretization: traverse the continuous attribute njAnd replacing each value with a discrete value given by the corresponding interval according to the interval where the value is located.
2. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: step 1 the discrete particle size | njAnd | is the number of different values.
3. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: in step 1, the probabilities q of the different valuesjiIs calculated as qji=nji/nj TotalWherein i is 1,2, …, | nj|,njiFor each occurrence of a different value, nj TotalFor all values, njiAnd nj TotalAnd counting while traversing the data.
4. The information entropy-based continuous attribute data unsupervised discretization method according to claim 1, wherein the method comprises the following steps: in the step 5, each interval is endowed with a unique discrete value, and the discrete values of the intervals are different.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711450629.6A CN108073553B (en) | 2017-12-27 | 2017-12-27 | Continuous attribute data unsupervised discretization method based on information entropy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711450629.6A CN108073553B (en) | 2017-12-27 | 2017-12-27 | Continuous attribute data unsupervised discretization method based on information entropy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108073553A CN108073553A (en) | 2018-05-25 |
CN108073553B true CN108073553B (en) | 2021-02-12 |
Family
ID=62155501
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711450629.6A Expired - Fee Related CN108073553B (en) | 2017-12-27 | 2017-12-27 | Continuous attribute data unsupervised discretization method based on information entropy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108073553B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109447163B (en) * | 2018-11-01 | 2022-03-22 | 中南大学 | Radar signal data-oriented moving object detection method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6026399A (en) * | 1997-05-30 | 2000-02-15 | Silicon Graphics, Inc. | System and method for selection of important attributes |
CN101777039A (en) * | 2009-11-20 | 2010-07-14 | 大连理工大学 | Continuous attribute discretization method based on Chi2 statistics |
CN102096672A (en) * | 2009-12-09 | 2011-06-15 | 西安邮电学院 | Method for extracting classification rule based on fuzzy-rough model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170344890A1 (en) * | 2016-05-26 | 2017-11-30 | Arun Kumar Parayatham | Distributed algorithm to find reliable, significant and relevant patterns in large data sets |
-
2017
- 2017-12-27 CN CN201711450629.6A patent/CN108073553B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6026399A (en) * | 1997-05-30 | 2000-02-15 | Silicon Graphics, Inc. | System and method for selection of important attributes |
CN101777039A (en) * | 2009-11-20 | 2010-07-14 | 大连理工大学 | Continuous attribute discretization method based on Chi2 statistics |
CN102096672A (en) * | 2009-12-09 | 2011-06-15 | 西安邮电学院 | Method for extracting classification rule based on fuzzy-rough model |
Non-Patent Citations (2)
Title |
---|
云环境下精准扶贫数据的异常检测研究;马生俊;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190615(第06期);论文第19-27页 * |
连续属性离散化算法研究综述;张钰莎;《计算机应用与软件》;20140831;第31卷(第8期);第6-8,140页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108073553A (en) | 2018-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Vreeken et al. | Krimp: mining itemsets that compress | |
Shin et al. | Sweg: Lossless and lossy summarization of web-scale graphs | |
Chen et al. | Mining frequent patterns in a varying-size sliding window of online transactional data streams | |
JP6503679B2 (en) | Filter rule creation device, filter rule creation method, and program | |
WO2013086931A1 (en) | Event mining in social networks | |
Kim et al. | Mining high utility itemsets based on the time decaying model | |
US20220114181A1 (en) | Fingerprints for compressed columnar data search | |
CN108073553B (en) | Continuous attribute data unsupervised discretization method based on information entropy | |
Yun et al. | Efficient mining of robust closed weighted sequential patterns without information loss | |
Sukhija et al. | Topic modeling and visualization for big data in social sciences | |
Wang et al. | Negative sequence analysis: A review | |
CN110928925A (en) | Frequent item set mining method and device, storage medium and electronic equipment | |
Gajawada et al. | Projected clustering using particle swarm optimization | |
Oreški et al. | Comparison of feature selection techniques in knowledge discovery process | |
Li et al. | An analysis of network performance degradation induced by workload fluctuations | |
Vahdat et al. | On the application of GP to streaming data classification tasks with label budgets | |
CN111723089A (en) | Method and device for processing data based on columnar storage format | |
Li et al. | A single-scan algorithm for mining sequential patterns from data streams | |
Goyal et al. | AnyFI: An anytime frequent itemset mining algorithm for data streams | |
Sorgente et al. | The reaction of a network: Exploring the relationship between the bitcoin network structure and the bitcoin price | |
Manike et al. | Modified GUIDE (LM) algorithm for mining maximal high utility patterns from data streams | |
Cao et al. | An algorithm for outlier detection on uncertain data stream | |
Ham et al. | MBiS: an efficient method for mining frequent weighted utility itemsets from quantitative databases | |
Ikonomovska et al. | Algorithmic techniques for processing data streams | |
Qin et al. | An improved genetic clustering algorithm for categorical data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210212 Termination date: 20211227 |
|
CF01 | Termination of patent right due to non-payment of annual fee |