CN108073553A - The unsupervised discretization method of connection attribute data based on comentropy - Google Patents

The unsupervised discretization method of connection attribute data based on comentropy Download PDF

Info

Publication number
CN108073553A
CN108073553A CN201711450629.6A CN201711450629A CN108073553A CN 108073553 A CN108073553 A CN 108073553A CN 201711450629 A CN201711450629 A CN 201711450629A CN 108073553 A CN108073553 A CN 108073553A
Authority
CN
China
Prior art keywords
mrow
attribute
value
breakpoint
msub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711450629.6A
Other languages
Chinese (zh)
Other versions
CN108073553B (en
Inventor
马生俊
陈旺虎
郭宏乐
乔保民
李新田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN201711450629.6A priority Critical patent/CN108073553B/en
Publication of CN108073553A publication Critical patent/CN108073553A/en
Application granted granted Critical
Publication of CN108073553B publication Critical patent/CN108073553B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Algebra (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to big data Discretization for Continuous Attribute technical field more particularly to a kind of unsupervised discretization methods of connection attribute data based on comentropy.Step is as follows:All values record of step 1, any connection attribute of traversal counts the discrete granularity of the attribute | nj| and the probability q of variant valueji, record maximum nj maxWith minimum value nj min;Step 2 draws any connection attribute n according to the calculation formula of comentropyjThe calculation formula of value confusion degree calculates the value confusion degree of the attribute according to formula;Step 3 obtains breakpoint number to the downward rounding of value confusion degree;Step 4, using wide section method computation partition each section width, determine the position of each breakpoint;Step 5, to connection attribute njCarry out discretization.The mode of this brand-new definite breakpoint number of the present invention, is suitable for former data, and each attribute discretization is independent of each other, and independent of other attributes, computational efficiency higher.

Description

The unsupervised discretization method of connection attribute data based on comentropy
Technical field
The present invention relates to big data Discretization for Continuous Attribute technical field more particularly to a kind of continuous categories based on comentropy The property unsupervised discretization method of data.
Background technology
Discretization for Continuous Attribute is that the value range of connection attribute is divided into several sections, and each section corresponds to unique Centrifugal pump, raw value is made to be changed into the process of centrifugal pump.Domestic and international researchers are discrete in connection attribute (numerical attribute) Very more methods is proposed in terms of change, by different angles there are many sorting technique, be respectively it is top-down with it is bottom-up, There are supervision and unsupervised, whole and local, static and dynamic and list attribute and more attributes etc..The essence of Discretization for Continuous Attribute The problem of being to determine centrifugal pump number (interval number, intervals) and breakpoint location, from the angle for how determining centrifugal pump number From the point of view of, mainly there are following a few class methods.
First, the method for the subjective specified centrifugal pump number of user.Typically have wide interval method (EWD), etc. frequencies interval method (EFD), the discretization method based on cluster and representative CADD (Class-Attribute Dependent Discretizer relies on class and the discretization of attribute) method etc., these methods are required for user to specify centrifugal pump number in advance K, although there is certain effect, due to lacking theoretical foundation, being difficult to accurately hold centrifugal pump number K, such method mainly lacks Point is user is needed to specify centrifugal pump number K.
Secondly, it is assumed that centrifugal pump number K is with recording methods of the number f or interval width d there are relation in section.Typically PD (Proportional Discretizer, inverse discrete) methods and FIMUS (Rahman and Islam, 2014) side Method, PD methods assume that the number K of centrifugal pump is equal to the record number f of each minizone, and K*f=D, and its essence is a kind of Deng Pin areas Between method;FIMUS methods assume that the number K of centrifugal pump is equal to the width d of each minizone, and K*d=[min, max], essence It is a kind of wide interval method.Both the above method avoids user's input parameter, but needs to assume condition in advance, gives tacit consent to single area Interior record number and the product of interval number are equal to record sum or value range, lack theoretical foundation.
Third, centrifugal pump number K is determined according to the relation between categorical attribute or Category Attributes.Method based on this thought Compare it is more, it is representative to have CAIM (Class-Attribute Interdependence Maximization, class and attribute phase Guan Du maximize), LFD (Low Frequency Discretizer, low frequency discretization), MDLP (Minimum Description Length Principle, Minimum description length criterion) the methods of.CAIM is a kind of improvement of CADD, should Method turns to target using caim values as discrete discriminate, to reach class with attributes correlation maximum, is mutually closed by generic attribute The heuristic rule of system generates interval number as minimum as possible;LFD is only to consider that connection attribute belongs to classification based on CAIM etc. Property relation carry out discretization method and most methods using high frequency value as Candidate point on the basis of improved, this method exists It carries out considering the relation with all properties during Discretization for Continuous Attribute, such as categorical attribute, Category Attributes and the category being discretized Property, and using low frequency value as Candidate point;The method of similar CAIM and LFD generated in discretization to categorical attribute or other The dependence of attribute.MDLP methods be based on comentropy and minimum length description classical way, the recursive selection breakpoint of this method, Attempt the information content of minimum model, mdlp is utilized to determine suitable centrifugal pump number;Similarly, using comentropy thought into The method of row discretization has very much, but is mostly determined whether to merge or divide section according to some form of entropy.
In above method, the method for the subjective specified centrifugal pump number of user lacks the adaptability to former data;It is assumed that condition Method lack theoretical foundation;By didactic method discretization process is caused to rely on other attributes;Using the side of comentropy Method is not to determine centrifugal pump number according to the comentropy of connection attribute, and it is larger to calculate cost.
The content of the invention
To solve the above problems, the present invention provides a kind of unsupervised discretization sides of connection attribute data based on comentropy The variant value of connection attribute is considered as discrete event, connection attribute value is obtained by way of calculating comentropy by method Confusion degree, using confusion degree as determine breakpoint number foundation, i.e., according to the comentropy of connection attribute value as breakpoint according to According to.
The specific technical solution that the present invention uses for:
A kind of unsupervised discretization method of connection attribute data based on comentropy, comprises the following steps:
Step 1, the analysis of attribute value:Ergodic data concentrates any connection attribute njAll values record, count the category The discrete granularity of property | nj| and the probability of variant value is calculated, record maximum nj maxWith minimum value nj min
Step 2, the confusion degree of computation attribute:Any connection attribute in data set is drawn according to the calculation formula of comentropy njThe calculation formula of value confusion degree, i.e.,
Wherein, TjRepresent the confusion degree of j-th of connection attribute, | nj| represent the discrete granularity of j-th of connection attribute, qji Represent the probability of i-th of value of the attribute;The value confusion degree of the attribute is calculated according to formula (1);
Step 3 calculates breakpoint number:Breakpoint number is obtained to the downward rounding of value confusion degree, i.e.,
Wherein, NumPjRepresent the breakpoint number of j-th of attribute;
Step 4 determines breakpoint location:Using the width in each section of the method computation partition in wide section, i.e.,
Wherein, wjRepresent the width in each section,Represent the value range of j-th of attribute, i.e., The interval number of division is NumPj+1;The position of each breakpoint is determined according to formula (4),
Wherein, DoCjFor the set of the attribute breakpoint,Each breakpoint is represented, which breakpoint th represents, nj min+(th*wj) represent the th breakpoint position;The scope in each section is divided according to breakpoint location, each interval range is such as Under:[nj min, nj min+wj), [nj min+wj, nj min+2wj) ... [nj min+(NumPj*wj), nj max];
Step 5, to connection attribute njCarry out discretization:Travel through connection attribute njAll values record, taken according to each Section where being worth is replaced with the centrifugal pump assigned for place section.
Further, discrete granularity described in step 1 | nj| it is the number of different values.
Further, in step 1, the probability q of variant valuejiCalculation formula be
qji=nji/nJ is total, wherein, i=1,2 ..., | nj|, njiFor the occurrence number of each different values, nJ is totalIt is taken to be all Value sum, njiAnd nJ is totalIt is counted in ergodic data.
Further, each section assigns unique centrifugal pump in step 5, and the centrifugal pump in each section differs.
The variant value of connection attribute is considered as discrete event, calculates each value by the present invention according to information entropy theory Confusion degree, and using confusion degree as the foundation for determining breakpoint number.The mode of this brand-new definite breakpoint number of the present invention, more Former data are adapted to, each attribute discretization is independent of each other, and independent of other attributes, computational efficiency higher.
Description of the drawings
Fig. 1 is the flow diagram of the method for the present invention.
Specific embodiment
The present invention is described in detail below in conjunction with the accompanying drawings.
First, term of the present invention and discretization process are briefly described, readily appreciate the method for the present invention.
Discrete granularity:For given data set, the number of the different values of any attribute is known as the discrete of the attribute Granularity.The discrete granularity of Category Attributes is denoted as | c |, the discrete granularity of connection attribute is denoted as | n |.
Comentropy is the probabilistic theory of metric, is defined as the probability of occurrence of Discrete Stochastic event.It is similar Ground, each different values of connection attribute can be regarded as a Discrete Stochastic event, the discrete granularity of the attribute be equal to it is discrete with The number of machine event.Comentropy by the different values for calculating connection attribute, can characterize the uncertainty of connection attribute i.e. The confusion degree of value can determine the discrete breakpoint number of connection attribute according to the confusion degree of connection attribute.
The calculation formula of comentropy is as follows:
Wherein, m be source symbol different value numbers, piRepresent the probability that i-th kind of value occurs.
Discretization for any connection attribute of data set is by its value range(nminIt represents The minimum value of attribute n, nmaxRepresent the maximum of attribute n) the problem of being divided into k section, the position each divided is known as disconnected Point.The value range of connection attribute is divided into k section, each section centrifugal pump 1,2 ... k are represented;P is breakpoint, altogether There are k-1.It understands, the problem of Discretization for Continuous Attribute is to determine section and breakpoint, is represented by DoC={ K, P }, wherein K is The set in section, K={ 1,2 ..., k }, P are the set of breakpoint, P={ p1, p2..., pk-1}。
The number that total interval number is equal to breakpoint adds 1 i.e. | K |=| P |+1, so the essence of Discretization for Continuous Attribute is appreciated that The problem of to determine breakpoint number and each breakpoint location, i.e.,Which breakpoint th represents, s represents the breakpoint Position.
Next, the step of the method for the present invention, is described in detail.
A kind of unsupervised discretization method of connection attribute data based on comentropy, comprises the following steps:
Step 1, the analysis of attribute value:Ergodic data concentrates any connection attribute njAll values record, statistics is all Value sum nJ is total, dispersion | nj| the occurrence number n of (numbers of different values) and each different valuesji, calculate each value Probability qji, qji=nji/nJ is total, i=1,2 ..., | nj|, record maximum nj maxWith minimum value nj min
Step 2, the confusion degree of computation attribute:Any connection attribute in data set is drawn according to the calculation formula of comentropy njThe calculation formula of value confusion degree, i.e.,
Wherein, TjRepresent the confusion degree of j-th of connection attribute, | nj| represent the discrete granularity of j-th of connection attribute, i.e., The number of different values, qjiRepresent the probability of i-th of value of the attribute;The value confusion journey of the attribute is calculated according to formula (1) Degree.
For example, comentropy is minimum when the discrete granularity of certain connection attribute is 1, value 0 represents 0 breakpoint, that is, institute of insertion There is value all in same section;Comentropy is maximum when the discrete granularity of certain connection attribute is D, and value isRepresent insertionA breakpoint, that is, all values are divided inA section.
Step 3 calculates breakpoint number:In most cases, the result obtained by formula (1) is not integer, however of breakpoint Number must be integer.In order to make the i.e. centrifugal pump number of demarcation interval less, by being broken to the downward rounding of value confusion degree Points, i.e.,
Wherein, NumPjRepresent the breakpoint number of j-th of attribute.
Step 4 determines breakpoint location:It is low in order to simply should be readily appreciated that and calculate consumption, select the method meter in wide section The width in each section of division is calculated, i.e.,
Wherein, wjRepresent the width in each section,Represent the value range of j-th of attribute, Total interval number of division is NumPj+1;The position of each breakpoint is determined according to formula (4),
Wherein, DoCjFor the set of the attribute breakpoint,Each breakpoint is represented, which breakpoint th represents, nj min+(th*wj) represent the th breakpoint position.
The scope in each section is divided according to breakpoint location, each interval range is as follows:[nj min, nj min+wj), [nj min+wj, nj min+2wj) ... [nj min+(NumPj*wj), nj max]。
Step 5, to connection attribute njCarry out discretization:According to the interval number of division, for each section assign it is different from Dissipate value.Travel through connection attribute njAll values record, according to the section where each value, be replaced with as location Between the centrifugal pump that assigns.
The above are the preferred embodiment of the present invention, can not be used to limit protection scope of the present invention.People in the art The Variations similar that member is made according to embodiment all belongs to the scope of protection of the present invention.

Claims (4)

1. a kind of unsupervised discretization method of connection attribute data based on comentropy, it is characterised in that comprise the following steps:
Step 1, the analysis of attribute value:Ergodic data concentrates any connection attribute njAll values record, count the attribute from Shot degree | nj| and the probability of variant value is calculated, record maximum nj maxWith minimum value nj min
Step 2, the confusion degree of computation attribute:Any connection attribute n in data set is drawn according to the calculation formula of comentropyjIt takes It is worth the calculation formula of confusion degree, i.e.,
<mrow> <msub> <mi>T</mi> <mi>j</mi> </msub> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <msub> <mi>n</mi> <mi>j</mi> </msub> <mo>|</mo> </mrow> </munderover> <mo>-</mo> <msub> <mi>q</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <msubsup> <mi>log</mi> <mn>2</mn> <msub> <mi>q</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, TjRepresent the confusion degree of j-th of connection attribute, | nj| represent the discrete granularity of j-th of connection attribute, qjiIt represents The probability of i-th of value of the attribute;The value confusion degree of the attribute is calculated according to formula (1);
Step 3 calculates breakpoint number:Breakpoint number is obtained to the downward rounding of value confusion degree, i.e.,
Wherein, NumPjRepresent the breakpoint number of j-th of attribute;
Step 4 determines breakpoint location:Using the width in each section of the method computation partition in wide section, i.e.,
Wherein, wjRepresent the width in each section,Represent the value range of j-th of attribute, i.e.,It draws The interval number divided is NumPj+1;The position of each breakpoint is determined according to formula (4),
<mrow> <msub> <mi>DoC</mi> <mi>j</mi> </msub> <mo>=</mo> <mo>{</mo> <msubsup> <mi>P</mi> <mrow> <msubsup> <mi>n</mi> <mi>j</mi> <mi>min</mi> </msubsup> <mo>+</mo> <mrow> <mo>(</mo> <mrow> <mi>t</mi> <mi>h</mi> <mo>*</mo> <msub> <mi>w</mi> <mi>j</mi> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msubsup> <mo>|</mo> <mn>1</mn> <mo>&amp;le;</mo> <mi>t</mi> <mi>h</mi> <mo>&amp;le;</mo> <msub> <mi>NumP</mi> <mi>j</mi> </msub> <mo>,</mo> <mi>t</mi> <mi>h</mi> <mo>&amp;Element;</mo> <msup> <mi>N</mi> <mo>*</mo> </msup> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, DoCjFor the set of the attribute breakpoint,Each breakpoint is represented, th represents which breakpoint, nj min+ (th*wj) represent the th breakpoint position;The scope in each section is divided according to breakpoint location, each interval range is as follows: [nj min, nj min+wj), [nj min+wj, nj min+2wj) ... [nj min+(NumPj*wj), nj max];
Step 5, to connection attribute njCarry out discretization:Travel through connection attribute njAll values record, according to each value institute Section, be replaced with for place section assign centrifugal pump.
2. the connection attribute data unsupervised discretization method according to claim 1 based on comentropy, it is characterised in that: Discrete granularity described in step 1 | nj| it is the number of different values.
3. the connection attribute data unsupervised discretization method according to claim 1 based on comentropy, it is characterised in that: In step 1, the probability q of variant valuejiCalculation formula be qji=nji/nJ is total, wherein, i=1,2 ..., | nj|, njiTo be every The occurrence number of a difference value, nJ is totalFor all values sum, njiAnd nJ is totalIt is counted in ergodic data.
4. the connection attribute data unsupervised discretization method according to claim 1 based on comentropy, it is characterised in that: Each section assigns unique centrifugal pump in step 5, and the centrifugal pump in each section differs.
CN201711450629.6A 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy Expired - Fee Related CN108073553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711450629.6A CN108073553B (en) 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711450629.6A CN108073553B (en) 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy

Publications (2)

Publication Number Publication Date
CN108073553A true CN108073553A (en) 2018-05-25
CN108073553B CN108073553B (en) 2021-02-12

Family

ID=62155501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711450629.6A Expired - Fee Related CN108073553B (en) 2017-12-27 2017-12-27 Continuous attribute data unsupervised discretization method based on information entropy

Country Status (1)

Country Link
CN (1) CN108073553B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447163A (en) * 2018-11-01 2019-03-08 中南大学 A kind of mobile object detection method towards radar signal data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026399A (en) * 1997-05-30 2000-02-15 Silicon Graphics, Inc. System and method for selection of important attributes
CN101777039A (en) * 2009-11-20 2010-07-14 大连理工大学 Continuous attribute discretization method based on Chi2 statistics
CN102096672A (en) * 2009-12-09 2011-06-15 西安邮电学院 Method for extracting classification rule based on fuzzy-rough model
US20170344890A1 (en) * 2016-05-26 2017-11-30 Arun Kumar Parayatham Distributed algorithm to find reliable, significant and relevant patterns in large data sets

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026399A (en) * 1997-05-30 2000-02-15 Silicon Graphics, Inc. System and method for selection of important attributes
CN101777039A (en) * 2009-11-20 2010-07-14 大连理工大学 Continuous attribute discretization method based on Chi2 statistics
CN102096672A (en) * 2009-12-09 2011-06-15 西安邮电学院 Method for extracting classification rule based on fuzzy-rough model
US20170344890A1 (en) * 2016-05-26 2017-11-30 Arun Kumar Parayatham Distributed algorithm to find reliable, significant and relevant patterns in large data sets

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张钰莎: "连续属性离散化算法研究综述", 《计算机应用与软件》 *
马生俊: "云环境下精准扶贫数据的异常检测研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447163A (en) * 2018-11-01 2019-03-08 中南大学 A kind of mobile object detection method towards radar signal data
CN109447163B (en) * 2018-11-01 2022-03-22 中南大学 Radar signal data-oriented moving object detection method

Also Published As

Publication number Publication date
CN108073553B (en) 2021-02-12

Similar Documents

Publication Publication Date Title
US8533825B1 (en) System, method and computer program product for collusion detection
Muller et al. OutRank: ranking outliers in high dimensional data
Leung et al. A data science solution for mining interesting patterns from uncertain big data
Chen et al. Mining frequent items in data stream using time fading model
CN103514304A (en) Project recommendation method and device
CN104054072A (en) Event mining in social networks
Keren et al. Geometric monitoring of heterogeneous streams
CN105389590A (en) Video clustering recommendation method and apparatus
CN104778237A (en) Individual recommending method and system based on key users
CN105678590A (en) topN recommendation method for social network based on cloud model
CN111738843A (en) Quantitative risk evaluation system and method using running water data
Li et al. Exploring the diversity of retweeting behavior patterns in Chinese microblogging platform
Assem et al. RCMC: Recognizing crowd-mobility patterns in cities based on location based social networks data
US20150169794A1 (en) Updating location relevant user behavior statistics from classification errors
Deylami et al. Link prediction in social networks using hierarchical community detection
McNamara et al. Predicting high impact academic papers using citation network features
You Spatiotemporal data-adaptive clustering algorithm: an intelligent computational technique for city big data
Ghosh et al. Activity-based mobility profiling: A purely temporal modeling approach
CN108073553A (en) The unsupervised discretization method of connection attribute data based on comentropy
Wang et al. CD: A coupled discretization algorithm
Halkidi et al. A semi-supervised incremental clustering algorithm for streaming data
CN106611339B (en) Seed user screening method, and product user influence evaluation method and device
Zarindast et al. Big Data application in congestion detection and classification using Apache spark
Wei et al. An incremental algorithm for clustering spatial data streams: exploring temporal locality
Portela et al. The search of conditional outliers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210212

Termination date: 20211227