WO2019136799A1 - Data discretisation method and apparatus, computer device and storage medium - Google Patents

Data discretisation method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2019136799A1
WO2019136799A1 PCT/CN2018/077137 CN2018077137W WO2019136799A1 WO 2019136799 A1 WO2019136799 A1 WO 2019136799A1 CN 2018077137 W CN2018077137 W CN 2018077137W WO 2019136799 A1 WO2019136799 A1 WO 2019136799A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
loss rate
data
entropy
interval
Prior art date
Application number
PCT/CN2018/077137
Other languages
French (fr)
Chinese (zh)
Inventor
晏存
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019136799A1 publication Critical patent/WO2019136799A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a data discretization method, apparatus, computer device, and storage medium.
  • the database is getting bigger and bigger. People urgently need to perform data mining on huge databases to get valuable information. Since the collected data is mostly continuous, in order to better carry out knowledge. Discovery and rule extraction, data discretization technology becomes the key, and the discretization of continuous attributes is an important pre-processing step of data mining and machine learning, which is directly related to the effect of learning.
  • the discretization preprocessing of the training sample set has dual significance. On the one hand, it can effectively reduce the complexity of the learning algorithm, speed up the learning, and even improve the learning classification accuracy; on the other hand, it can be simplified and summarized. Knowledge to improve the comprehensibility of classification results. Therefore, the discretization problem has been extensively and deeply studied.
  • the application provides a data discretization method, device, computer device and storage medium to improve the training effect of machine learning.
  • the present application provides a data discretization method, the method comprising:
  • Entropy-based data discretization discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;
  • the target data set is output to complete data discretization of the value range of the service data.
  • the present application provides a data discretization device, the device comprising:
  • a discrete generation calculation unit configured to discretize data based on entropy, discretize a value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval
  • a first merge calculation unit configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval;
  • a second merge calculation unit configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate
  • An entropy loss rate calculation unit configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;
  • a loss rate determining unit configured to determine whether the entropy loss rate is greater than the interval loss rate
  • a data set output unit configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.
  • the present application also provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program
  • a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program
  • the present application also provides a storage medium, wherein the storage medium stores a computer program, the computer program comprising program instructions, the program instructions, when executed by a processor, causing the processor to execute the application A data discretization method as claimed in any of the preceding claims.
  • the embodiment of the present application discretizes the value range of the service data into a discrete data set by using entropy-based data discretization, wherein the discrete data set includes multiple data intervals; and the data interval is merged by using a preset merge rule until the merged
  • the entropy loss rate of the data set is greater than the interval loss rate, so that the discrete interval of the merged data set is as small as possible and the entropy is as large as possible, thereby improving the effect of data discretization and facilitating data mining and machine learning.
  • FIG. 1 is a schematic flow chart of a data discretization method according to an embodiment of the present application.
  • FIG. 2 is a schematic flow chart of a data discretization method according to another embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a data discretization apparatus according to another embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 is a schematic flow chart of a data discretization method according to an embodiment of the present application. As shown in FIG. 1, the data discretization method includes steps S101 to S107.
  • the attribute of the service data is a continuous attribute. Based on entropy-based data discretization, the continuous range of values is divided into multiple cells. These cells are data intervals, and multiple data intervals form a discrete data set.
  • the value range of the service data is discretized to generate a corresponding discrete data set, and the splitting point may be first determined, and the continuous value is discretized according to the splitting point, for example, using existing discrete Attribute A, select the value of A with the smallest entropy as the split point, and recursively divide the data interval to get the discrete data set.
  • the information entropy of the discrete data set is calculated, and the information entropy of the discrete data set is calculated by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:
  • n is a positive integer greater than 1
  • i is a positive integer between 1 and n
  • p i is the probability of occurrence of the ith data
  • H(p) is information entropy.
  • calculating the information entropy of the discrete data set by using a calculation formula of information entropy firstly, the data intervals are arranged in order from small to large, and the number of occurrences of each of the data intervals is counted, according to the number of occurrences The probability distribution of the data interval can be calculated.
  • the information entropy of the discrete data set can be calculated by using the expression 1-1 according to the probability of the data interval, and is recorded as G0.
  • the preset merge rule is to merge the data intervals in the discrete data set by using a preset manner, where the preset merge rule is, for example, merging two adjacent ones of the discrete data sets. A data interval, or an merging of two alternating data intervals in the discrete data set. It should be noted that, in the same embodiment, only the same preset merge rule is used, for example, the two adjacent data segments in the discrete data set are combined, and the merged discrete data is used in the subsequent loop merge mode. The way two adjacent data intervals are in the collection.
  • the discrete data set is S0, which includes a plurality of data intervals denoted as S00, S01, S02...S0n.
  • S00 and S01, S01 and S02 are two adjacent data intervals, and the two alternate data intervals are S00 and S02, S01 and S03.
  • Merging the adjacent two data intervals in the discrete data set will generate new data intervals, such as (S00, S01), (S01, S02), ... (S0n-1, S0n), these new data
  • the interval is the pre-combined data interval.
  • the information entropy corresponding to these pre-combined data intervals is calculated by the calculation formula of information entropy.
  • the information entropy corresponding to these pre-merged data intervals will be large and small, and the pre-existing data entropy will be found. Combine data ranges.
  • the pre-merged data interval having the largest information entropy in the discrete data set is found as (S02, S03), that is, the information entropy corresponding to the pre-merged data interval is compared with other pre-merging.
  • the information entropy corresponding to the data interval is large, and the pre-merged data interval is truly merged, and is recorded as AS0203, that is, the pre-merged data interval having the largest information entropy in the discrete data set is combined as the target data set. Therefore, the data range included in the target data set is S00, S01, AS0203, S04...S0n.
  • the original discrete data set will have a loss of data interval and information entropy relative to the target data set, and thus the interval loss rate of the target data set can also be calculated.
  • the interval loss rate of the target data set may be calculated by using a preset interval loss rate formula, where the preset interval loss rate formula is:
  • L q is the interval loss rate
  • x is the number of data intervals lost after each combination
  • N is the number of data intervals of the discrete data set.
  • the interval loss rate of the target data set is denoted as L 1 .
  • the interval loss rate L 1 1/N of the target data set can be calculated from the preset interval loss rate formula.
  • the entropy loss rate of the target data set is calculated according to the information entropy of the discrete data set and the information entropy of the target data set by using a preset entropy loss rate formula, the preset entropy loss.
  • the rate formula is:
  • H q is the entropy loss rate
  • G 0 is the information entropy of the discrete data set
  • G is the information entropy of the target data set.
  • the preset entropy loss rate formula is associated with the preset interval loss rate formula, and if the N in the preset interval loss rate formula is changed with each data interval combination, Then, G0 in the preset entropy loss rate also needs to be selected to change with each data interval combination to improve the accuracy of the calculation.
  • step S106 it is specifically determined whether the entropy loss rate H 1 of the target data set is greater than the interval loss rate L 1 of the target data set. If the entropy loss rate is greater than the interval loss rate, step S106 is performed; if the entropy loss rate is not greater than the interval loss rate, step S107 is performed.
  • the target data set is output to complete data discretization of the value range of the service data, and specifically, the target data set may be performed.
  • the save and save address information is sent to the user, because the user extracts the target data set as needed, such as model training for data mining or machine learning.
  • the target data set is executed as the discrete data set, and the steps S102 to S105 are performed to perform the next round of data interval merging.
  • the cycle is continued until the entropy loss rate is greater than the interval loss rate, and the continuous loop merging is stopped, wherein the target data set corresponding to the entropy loss rate being greater than the interval loss rate is the result of the last required data discretization.
  • the above embodiment discretizes the value range of the service data into a discrete data set by entropy-based data discretization, wherein the discrete data set includes a plurality of data intervals; and the data interval is merged by the preset merge rule until the merged data
  • the entropy loss rate of the set is greater than the interval loss rate, so that the discrete interval of the combined data set is as small as possible and the entropy is as large as possible, thereby improving the effect of data discretization and facilitating data mining and machine learning.
  • FIG. 2 is a schematic flowchart of a data discretization method according to another embodiment of the present application.
  • the data discretization method is specifically based on entropy-based data discretization, and can be run in a terminal or a server to discretize continuous attributes of data.
  • the data discretization method includes steps S201 to S209.
  • the determining the value range of the service data may determine the value range of the service data according to a manner of intercepting the user, or may be intercepted according to a preset intercept window manner to determine the service range.
  • the value range of the data, the preset intercept window can be set by the user according to actual needs.
  • the value range is the valid range of the business data and can reflect certain characteristics of the business data.
  • the process of processing the value range of the service data according to the preset processing rule includes: filtering, reducing, or normalizing the value range of the service data, etc., Better applied to data mining or machine learning for future discretization.
  • the method of filtering noise reduction processing or normalization processing adopts the existing method, and will not be described in detail herein.
  • the value range of the service data is discretized to generate a corresponding discrete data set, and the discrete data set includes a plurality of data intervals. How many times the data interval is sorted to count the number of occurrences, and then the information entropy of the discrete data set can be calculated according to the calculation formula of the information entropy.
  • the discrete data set is S0, which includes a plurality of data intervals denoted as S00, S01, S02, ..., S0n.
  • Merging the adjacent two data intervals in the discrete data set will generate new data intervals, such as (S00, S01), (S01, S02), ... (S0n-1, S0n), these new data
  • the interval is the pre-combined data interval.
  • the information entropy corresponding to these pre-combined data intervals is calculated by the calculation formula of information entropy.
  • the information entropy corresponding to these pre-merged data intervals will be large and small, and the pre-existing data entropy will be found. Combine data ranges.
  • the original discrete data set may have a data interval relative to the target data set.
  • the interval loss rate calculation formula in the above embodiment is used for calculation.
  • the entropy loss rate of the target data set is calculated according to the information entropy of the discrete data set and the information entropy of the target data set by using Expressions 1-3.
  • step S208 it is determined whether the entropy loss rate is greater than the interval loss rate, and two determination results are generated. Specifically, if the entropy loss rate is greater than the interval loss rate, step S208 is performed; if the entropy loss rate is not greater than the interval loss rate, step S208 is performed.
  • the target data set is output to complete data discretization of the value range of the service data, and specifically, the target data set may be performed.
  • the save and save address information is sent to the user, because the user extracts the target data set as needed, such as model training for data mining or machine learning.
  • the target data set is executed as the discrete data set, and the steps S204 to S207 are performed to perform the next round of data interval merging.
  • the cycle is continued until the entropy loss rate is greater than the interval loss rate, and the continuous loop merging is stopped, wherein the target data set corresponding to the entropy loss rate being greater than the interval loss rate is the result of the last required data discretization.
  • FIG. 3 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application.
  • the data discretization device 300 can be installed in a server or a terminal.
  • the data discretization apparatus 300 includes: a discrete generation calculation unit 301, a first merge calculation unit 302, a second merge calculation unit 303, an entropy loss rate calculation unit 304, a loss rate determination unit 305, and a data set output unit. 306 and return to loop execution unit 307.
  • a discrete generation calculation unit 301 configured to discretize the value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval.
  • the first merge calculation unit 302 is configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval.
  • the second merge calculation unit 303 is configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate.
  • the entropy loss rate calculation unit 304 is configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.
  • the loss rate determining unit 305 is configured to determine whether the entropy loss rate is greater than the interval loss rate.
  • the loss rate determination unit 305 determines that the entropy loss rate is greater than the interval loss rate
  • the data set output unit 306 is invoked; if the loss rate determination unit 305 determines that the entropy loss rate is not greater than the interval loss The rate is then returned to the loop execution unit 307.
  • the data set output unit 306 is configured to output the target data set to complete data discretization of the value range of the service data.
  • loop execution unit 307 configured to set the target data set as the discrete data set and return to perform performing the pre-merging data sections in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals. The step until the entropy loss rate is greater than the interval loss rate.
  • FIG. 4 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application.
  • the data discretization device 400 can be installed in a server or a terminal.
  • the data discretization apparatus 400 includes: a value range determining unit 401, a value range processing unit 402, a discrete generation calculating unit 403, a first combining calculating unit 404, a second combining calculating unit 405, and an entropy loss rate.
  • the value range determining unit 401 is configured to obtain service data of the target service and determine a value range of the service data.
  • the value range processing unit 402 is configured to process the value range of the service data according to a preset processing rule.
  • a discrete generation calculation unit 403 for entropy-based data discretization, discretizing a value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval.
  • the first merge calculation unit 404 is configured to pre-merge two adjacent data segments in the discrete data set to obtain a plurality of pre-combined data intervals, and calculate information entropy of the pre-merged data interval.
  • the second merge calculation unit 405 is configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate.
  • the entropy loss rate calculation unit 406 is configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.
  • the loss rate determining unit 407 is configured to determine whether the entropy loss rate is greater than the interval loss rate.
  • the loss rate determining unit 407 determines that the entropy loss rate is greater than the interval loss rate
  • the data set output unit 408 is invoked; if the loss rate determining unit 407 determines that the entropy loss rate is not greater than the interval loss The rate is then returned to the loop execution unit 409.
  • the data set output unit 408 is configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.
  • the above apparatus may be embodied in the form of a computer program that can be run on a computer device as shown in FIG.
  • FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 device can be a terminal or a server.
  • the computer device 500 includes a processor 520, a memory and a network interface 550 connected by a system bus 510, wherein the memory can include a non-volatile storage medium 530 and an internal memory 540.
  • the non-volatile storage medium 530 can store an operating system 531 and a computer program 532.
  • the processor 520 can be caused to perform a data discretization method.
  • the processor 520 is used to provide computing and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 540 provides an environment for the operation of a computer program in a non-volatile storage medium that, when executed by the processor 520, causes the processor 520 to perform a data discretization method.
  • the network interface 550 is used for network communication, such as sending assigned tasks and the like. It will be understood by those skilled in the art that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device 500 to which the solution of the present application is applied, and a specific computer device. 500 may include more or fewer components than shown, or some components may be combined, or have different component arrangements.
  • the processor 520 is configured to run the program code stored in the memory to implement the process steps corresponding to the data discretization method in the foregoing embodiment.
  • the processor 520 may be a central processing unit (CPU), and the processor 520 may also be other general-purpose processors, a digital signal processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc.
  • the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • computer device 500 architecture illustrated in FIG. 5 does not constitute a limitation to computer device 500, may include more or fewer components than illustrated, or may combine certain components, or different components. Arrangement.
  • the computer readable storage medium may be a medium that can store program code, such as a magnetic disk, an optical disk, a USB flash drive, a mobile hard disk, a random access memory (RAM), a magnetic disk, or an optical disk.
  • program code such as a magnetic disk, an optical disk, a USB flash drive, a mobile hard disk, a random access memory (RAM), a magnetic disk, or an optical disk.
  • the units in the apparatus of the embodiment of the present application may be combined, divided, and deleted according to actual needs.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium.
  • the technical solution of the present application may be in essence or part of the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • There are a number of instructions for causing a computer device (which may be a personal computer, terminal, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.

Abstract

Disclosed in the present application are a data discretisation method and apparatus, a computer device, and a storage medium, the method comprising: entropy-based data discretisation discretises the value range of service data into a discrete data set, the discrete data set comprising a plurality of data intervals; using a preset merging rule to merge the data intervals until the entropy loss rate of the merged data set is greater than the interval loss rate; and outputting a target data set to complete the data discretisation of the service data.

Description

数据离散化方法、装置、计算机设备及存储介质Data discretization method, device, computer device and storage medium
本申请要求于2018年1月12日提交中国专利局、申请号为201810031540.4、发明名称为“数据离散化方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201810031540.4, filed on Jan. 12, 2018, entitled "Data Discretization Method, Apparatus, Computer Equipment, and Storage Media", the entire contents of which are incorporated by reference. Combined in this application.
技术领域Technical field
本申请涉及数据处理技术领域,尤其涉及一种数据离散化方法、装置、计算机设备及存储介质。The present application relates to the field of data processing technologies, and in particular, to a data discretization method, apparatus, computer device, and storage medium.
背景技术Background technique
目前,在大数据信息化时代,数据库变得越来越大,人们迫切的需要对庞大的数据库进行数据挖掘以得到有价值信息,由于采集到的数据多为连续的,为了更好地进行知识发现和规则提取,数据离散化技术成为关键,同时连续属性的离散化是数据挖掘和机器学习的重要预处理步骤,直接关系到学习的效果。在分类算法中,对训练样本集进行离散化预处理,具有双重意义,一方面可以有效降低学习算法的复杂度,加快学习速度,甚至提高学习分类精度;另一方面还可以简化和归纳获得的知识,提高分类结果的可理解性。因此,离散化问题得到了较为广泛和深入的研究。等宽和等频区间法的数据离散化是常见的离散化算法,虽然易于实现,但因为忽视了样本分布信息,因而难以将区间边界设置在最合适的位置上,从而使得它们的性能在大多数情况下无法取得令人满意效果。At present, in the era of big data informationization, the database is getting bigger and bigger. People urgently need to perform data mining on huge databases to get valuable information. Since the collected data is mostly continuous, in order to better carry out knowledge. Discovery and rule extraction, data discretization technology becomes the key, and the discretization of continuous attributes is an important pre-processing step of data mining and machine learning, which is directly related to the effect of learning. In the classification algorithm, the discretization preprocessing of the training sample set has dual significance. On the one hand, it can effectively reduce the complexity of the learning algorithm, speed up the learning, and even improve the learning classification accuracy; on the other hand, it can be simplified and summarized. Knowledge to improve the comprehensibility of classification results. Therefore, the discretization problem has been extensively and deeply studied. Data discretization of equal-width and equal-frequency interval methods is a common discretization algorithm. Although it is easy to implement, it is difficult to set the interval boundaries at the most suitable position because the sample distribution information is neglected, so that their performance is large. In most cases, satisfactory results cannot be achieved.
发明内容Summary of the invention
本申请提供了一种数据离散化方法、装置、计算机设备及存储介质,以提高机器学习的训练效果。The application provides a data discretization method, device, computer device and storage medium to improve the training effect of machine learning.
第一方面,本申请提供了一种数据离散化方法,该方法包括:In a first aspect, the present application provides a data discretization method, the method comprising:
基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数 据区间;Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;
根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵;Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;
将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率;Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;
根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率;Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;
判断所述熵损失率是否大于所述区间损失率;Determining whether the entropy loss rate is greater than the interval loss rate;
若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.
第二方面,本申请提供了一种数据离散化装置,该装置包括:In a second aspect, the present application provides a data discretization device, the device comprising:
离散生成计算单元,用于基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间;a discrete generation calculation unit, configured to discretize data based on entropy, discretize a value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval
第一合并计算单元,用于根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵;a first merge calculation unit, configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval;
第二合并计算单元,用于将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率;a second merge calculation unit, configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate;
熵损失率计算单元,用于根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率;An entropy loss rate calculation unit, configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;
损失率判断单元,用于判断所述熵损失率是否大于所述区间损失率;a loss rate determining unit, configured to determine whether the entropy loss rate is greater than the interval loss rate;
数据集合输出单元,用于若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。And a data set output unit, configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.
第三方面,本申请还提供了一种计算机设备,该计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现本申请提供的任一项所述的数据离散化方法。In a third aspect, the present application also provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program The data discretization method according to any one of the preceding claims is implemented.
第四方面,本申请还提供了一种存储介质,其中所述存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行本申请提供的任一项所述的数据离散化方法。In a fourth aspect, the present application also provides a storage medium, wherein the storage medium stores a computer program, the computer program comprising program instructions, the program instructions, when executed by a processor, causing the processor to execute the application A data discretization method as claimed in any of the preceding claims.
本申请实施例通过基于熵的数据离散化将业务数据的取值范围离散成离散数据集合,其中所述离散数据集合包括多个数据区间;采用预设合并规则合并所述数据区间直至合并后的数据集合的熵损失率大于区间损失率为止,使得该合并后的数据集合的离散区间尽可能少而熵又尽可能大,由此提高了数据离散化的效果,有利于数据挖掘和机器学习。The embodiment of the present application discretizes the value range of the service data into a discrete data set by using entropy-based data discretization, wherein the discrete data set includes multiple data intervals; and the data interval is merged by using a preset merge rule until the merged The entropy loss rate of the data set is greater than the interval loss rate, so that the discrete interval of the merged data set is as small as possible and the entropy is as large as possible, thereby improving the effect of data discretization and facilitating data mining and machine learning.
附图说明DRAWINGS
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. Obviously, the drawings in the following description are some embodiments of the present application, For the ordinary technicians, other drawings can be obtained based on these drawings without any creative work.
图1是本申请一实施例提供的一种数据离散化方法的示意流程图;1 is a schematic flow chart of a data discretization method according to an embodiment of the present application;
图2是本申请另一实施例提供的一种数据离散化方法的示意流程图;2 is a schematic flow chart of a data discretization method according to another embodiment of the present application;
图3是本申请一实施例提供的一种数据离散化装置的示意性框图;FIG. 3 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application; FIG.
图4是本申请另一实施例提供的一种数据离散化装置的示意性框图;4 is a schematic block diagram of a data discretization apparatus according to another embodiment of the present application;
图5是本申请一实施例提供的一种计算机设备的示意性框图。FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。The use of the terms "comprising", "comprising", "","," The presence or addition of a plurality of other features, integers, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used in the specification and the appended claims, the claims
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/ 或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It is further understood that the term "and/or" used in the specification and the appended claims means any combination and all possible combinations of one or more of the associated listed items, and .
请参阅图1,图1是本申请一实施例提供的一种数据离散化方法的示意流程图。如图1所示,该数据离散化方法包括步骤S101~S107。Please refer to FIG. 1. FIG. 1 is a schematic flow chart of a data discretization method according to an embodiment of the present application. As shown in FIG. 1, the data discretization method includes steps S101 to S107.
S101、基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间。S101. Entropy-based data discretization, discretizing a value range of service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals.
在本实施例中,该业务数据的属性为连续属性。基于熵的数据离散化,将连续的取值范围划分为多个小区间,这些小区间即是数据区间,多个数据区间组成离散数据集合。In this embodiment, the attribute of the service data is a continuous attribute. Based on entropy-based data discretization, the continuous range of values is divided into multiple cells. These cells are data intervals, and multiple data intervals form a discrete data set.
其中,基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,可首先确定分裂点,根据所述分裂点对连续数值进行离散化,例如采用现有的为了离散化属性A,选择A的具有最小熵的值作为分裂点,并递归地划分数据区间,以得到离散数据集合。Wherein, based on entropy-based data discretization, the value range of the service data is discretized to generate a corresponding discrete data set, and the splitting point may be first determined, and the continuous value is discretized according to the splitting point, for example, using existing discrete Attribute A, select the value of A with the smallest entropy as the split point, and recursively divide the data interval to get the discrete data set.
其中,计算所述离散数据集合的信息熵,具体为采用信息熵的计算公式计算所述离散数据集合的信息熵,其中所述信息熵的计算公式为:The information entropy of the discrete data set is calculated, and the information entropy of the discrete data set is calculated by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:
Figure PCTCN2018077137-appb-000001
Figure PCTCN2018077137-appb-000001
在表达式1-1中,n为大于1的正整数,i为1到n之间的正整数,p i为第i数据出现的概率,H(p)为信息熵。 In Expression 1-1, n is a positive integer greater than 1, i is a positive integer between 1 and n, p i is the probability of occurrence of the ith data, and H(p) is information entropy.
具体地,采用信息熵的计算公式计算所述离散数据集合的信息熵,首先需要将数据区间按照从小到大的顺序进行排列,并统计每个所述数据区间的出现次数,根据该出现次数即可计算出该数据区间的概率分布。根据所述数据区间的概率利用表达式1-1则可计算出所述离散数据集合的信息熵,记为G0。Specifically, calculating the information entropy of the discrete data set by using a calculation formula of information entropy, firstly, the data intervals are arranged in order from small to large, and the number of occurrences of each of the data intervals is counted, according to the number of occurrences The probability distribution of the data interval can be calculated. The information entropy of the discrete data set can be calculated by using the expression 1-1 according to the probability of the data interval, and is recorded as G0.
S102、根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵。S102. Pre-merging data segments in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculating an information entropy of the pre-merged data interval.
在本实施例中,所述预设合并规则为采用预设方式将所述离散数据集合中的数据区间进行合并,该预设合并规则比如为:合并所述离散数据集合中相邻的两个数据区间,或者合并所述离散数据集合中的交替的两个数据区间。需要说明的是,在同一实施例中只采用同一种预设合并规则,比如采用合并所述离 散数据集合中相邻的两个数据区间,在后续的循环合并方式均采用该合并所述离散数据集合中相邻的两个数据区间的方式。In this embodiment, the preset merge rule is to merge the data intervals in the discrete data set by using a preset manner, where the preset merge rule is, for example, merging two adjacent ones of the discrete data sets. A data interval, or an merging of two alternating data intervals in the discrete data set. It should be noted that, in the same embodiment, only the same preset merge rule is used, for example, the two adjacent data segments in the discrete data set are combined, and the merged discrete data is used in the subsequent loop merge mode. The way two adjacent data intervals are in the collection.
譬如,离散数据集合为S0,其包括多个数据区间记为S00、S01、S02...S0n。其中S00和S01、S01和S02均为相邻的两个数据区间,而交替的两个数据区间比如为S00和S02、S01和S03。合并所述离散数据集合中相邻的两个数据区间则会产生新的数据区间,比如(S00、S01)、(S01、S02)....(S0n-1、S0n),这些新的数据区间即为预合并数据区间,利用信息熵的计算公式分别计算这些预合并数据区间对应的信息熵,这些预合并数据区间对应的信息熵会有大有小,并查找其中具有最大信息熵的预合并数据区间。For example, the discrete data set is S0, which includes a plurality of data intervals denoted as S00, S01, S02...S0n. S00 and S01, S01 and S02 are two adjacent data intervals, and the two alternate data intervals are S00 and S02, S01 and S03. Merging the adjacent two data intervals in the discrete data set will generate new data intervals, such as (S00, S01), (S01, S02), ... (S0n-1, S0n), these new data The interval is the pre-combined data interval. The information entropy corresponding to these pre-combined data intervals is calculated by the calculation formula of information entropy. The information entropy corresponding to these pre-merged data intervals will be large and small, and the pre-existing data entropy will be found. Combine data ranges.
S103、将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率。S103. Combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy and an interval loss rate of the target data set.
在本实施例中,比如在步骤S102中查找到所述离散数据集合中具有最大信息熵的预合并数据区间为(S02、S03),即该预合并数据区间对应的信息熵比其他的预合并数据区间对应的信息熵都大,将该预合并数据区间进行真正的合并,并记为AS0203,即将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合。因此,该目标数据集合包括的数据区间为S00、S01、AS0203、S04...S0n。由于合并了具有最大熵的数据区间,因此目标数据集合的信息熵就会发生变化,因此需要根据信息熵的计算公式重新计算目标数据集合对应的信息熵,记为G1。In this embodiment, for example, in step S102, the pre-merged data interval having the largest information entropy in the discrete data set is found as (S02, S03), that is, the information entropy corresponding to the pre-merged data interval is compared with other pre-merging. The information entropy corresponding to the data interval is large, and the pre-merged data interval is truly merged, and is recorded as AS0203, that is, the pre-merged data interval having the largest information entropy in the discrete data set is combined as the target data set. Therefore, the data range included in the target data set is S00, S01, AS0203, S04...S0n. Since the data interval with the largest entropy is merged, the information entropy of the target data set changes. Therefore, it is necessary to recalculate the information entropy corresponding to the target data set according to the calculation formula of the information entropy, which is recorded as G1.
由于真正合并了其中两个数据区间,因此原来的离散数据集合相对于该目标数据集合会出现数据区间和信息熵的损失,由此还可以计算出该目标数据集合的区间损失率。Since the two data intervals are truly merged, the original discrete data set will have a loss of data interval and information entropy relative to the target data set, and thus the interval loss rate of the target data set can also be calculated.
具体地,可采用预设区间损失率公式计算所述目标数据集合的区间损失率,所述预设区间损失率公式为:Specifically, the interval loss rate of the target data set may be calculated by using a preset interval loss rate formula, where the preset interval loss rate formula is:
L q=x/N        (1-2) L q =x/N (1-2)
其中,L q为区间损失率,x为每次合并后损失的数据区间数,N为离散数据集合的数据区间数。 Where L q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set.
在本实施例中,由于是第一次合并,目标数据集合的区间损失率记为L 1。由预设区间损失率公式可计算出该目标数据集合的区间损失率L 1=1/N。 In the present embodiment, since it is the first merge, the interval loss rate of the target data set is denoted as L 1 . The interval loss rate L 1 =1/N of the target data set can be calculated from the preset interval loss rate formula.
S104、根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算 所述目标数据集合的熵损失率。S104. Calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.
在本实施例中,具体是根据所述离散数据集合的信息熵和所述目标数据集合的信息熵采用预设熵损失率公式计算所述目标数据集合的熵损失率,所述预设熵损失率公式为:In this embodiment, the entropy loss rate of the target data set is calculated according to the information entropy of the discrete data set and the information entropy of the target data set by using a preset entropy loss rate formula, the preset entropy loss. The rate formula is:
H q=(G 0-G)/G 0          (1-3) H q =(G 0 -G)/G 0 (1-3)
其中,H q为熵损失率,G 0为所述离散数据集合的信息熵,G为所述目标数据集合的信息熵。 Where H q is the entropy loss rate, G 0 is the information entropy of the discrete data set, and G is the information entropy of the target data set.
在本实施例中,所述目标数据集合的熵损失率记为H 1,由于上述预设熵损失率可以计算出该目标数据集合的熵损失率H 1=(G0-G1)/G0。 In this embodiment, the entropy loss rate of the target data set is recorded as H 1 , and the entropy loss rate H 1 =(G0-G1)/G0 of the target data set can be calculated due to the preset entropy loss rate.
需要说明的是,所述预设熵损失率公式与所述预设区间损失率公式是相关联的,如果所述预设区间损失率公式中的N采用随着每次数据区间合并而变化,那么所述预设熵损失率中的G0也需要选择随着每次数据区间合并进行变化,以提高计算的准确度。It should be noted that the preset entropy loss rate formula is associated with the preset interval loss rate formula, and if the N in the preset interval loss rate formula is changed with each data interval combination, Then, G0 in the preset entropy loss rate also needs to be selected to change with each data interval combination to improve the accuracy of the calculation.
S105、判断所述熵损失率是否大于所述区间损失率。S105. Determine whether the entropy loss rate is greater than the interval loss rate.
在本实施例中,具体为判断所述目标数据集合的熵损失率H 1是否大于该目标数据集合的区间损失率L 1。若所述熵损失率大于所述区间损失率,则执行步骤S106;若所述熵损失率不大于所述区间损失率,则执行步骤S107。 In this embodiment, it is specifically determined whether the entropy loss rate H 1 of the target data set is greater than the interval loss rate L 1 of the target data set. If the entropy loss rate is greater than the interval loss rate, step S106 is performed; if the entropy loss rate is not greater than the interval loss rate, step S107 is performed.
S106、输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。S106. Output the target data set to complete data discretization of the value range of the service data.
在本实施例中,若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化,具体可将所述目标数据集合进行保存并保存地址信息发送给用户,由于用户根据需要进行提取该目标数据集合,比如用于数据挖掘或机器学习中的模型训练等。In this embodiment, if the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data, and specifically, the target data set may be performed. The save and save address information is sent to the user, because the user extracts the target data set as needed, such as model training for data mining or machine learning.
S107、将所述目标数据集合设为所述离散数据集合并返回执行所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间的步骤,直至所述熵损失率大于所述区间损失率。S107. Set the target data set to the discrete data set and return to perform the step of pre-merging data segments in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals until the The entropy loss rate is greater than the interval loss rate.
在本实施例中,若所述熵损失率不大于所述区间损失率,则将所述目标数据集合作为所述离散数据集合执行上述步骤S102至S105进行下一轮的数据区间的合并,依次循环直至所述熵损失率大于所述区间损失率,停止继续循环合并,其中所述熵损失率大于所述区间损失率对应的那个目标数据集合就是最后 所需要的数据离散化的结果。In this embodiment, if the entropy loss rate is not greater than the interval loss rate, the target data set is executed as the discrete data set, and the steps S102 to S105 are performed to perform the next round of data interval merging. The cycle is continued until the entropy loss rate is greater than the interval loss rate, and the continuous loop merging is stopped, wherein the target data set corresponding to the entropy loss rate being greater than the interval loss rate is the result of the last required data discretization.
上述实施例通过基于熵的数据离散化将业务数据的取值范围离散成离散数据集合,其中所述离散数据集合包括多个数据区间;采用预设合并规则合并所述数据区间直至合并后的数据集合的熵损失率大于区间损失率为止,使得该合并后的数据集合的离散区间尽可能少而熵又尽可能大,由此提高了数据离散化的效果,有利于数据挖掘和机器学习。The above embodiment discretizes the value range of the service data into a discrete data set by entropy-based data discretization, wherein the discrete data set includes a plurality of data intervals; and the data interval is merged by the preset merge rule until the merged data The entropy loss rate of the set is greater than the interval loss rate, so that the discrete interval of the combined data set is as small as possible and the entropy is as large as possible, thereby improving the effect of data discretization and facilitating data mining and machine learning.
请参阅图2,图2是本申请另一实施例提供的一种数据离散化方法的示意流程图。该数据离散化方法具体为基于熵的数据离散化,可以运行在终端或服务器中以对数据的连续属性进行离散化。如图2所示,该数据离散化方法包括步骤S201~S209。Referring to FIG. 2, FIG. 2 is a schematic flowchart of a data discretization method according to another embodiment of the present application. The data discretization method is specifically based on entropy-based data discretization, and can be run in a terminal or a server to discretize continuous attributes of data. As shown in FIG. 2, the data discretization method includes steps S201 to S209.
S201、获取目标业务的业务数据并确定所述业务数据的取值范围。S201. Obtain service data of the target service and determine a value range of the service data.
在本实施例中,所述确定所述业务数据的取值范围,可以根据用户的选择截取等方式确定所述业务数据的取值范围,也可以根据预设截取窗口方式截取以确定所述业务数据的取值范围,该预设截取窗口可以由用户根据实际需要进行设定。该取值范围为业务数据的有效范围,可以反映出该业务数据的某些特征。In this embodiment, the determining the value range of the service data may determine the value range of the service data according to a manner of intercepting the user, or may be intercepted according to a preset intercept window manner to determine the service range. The value range of the data, the preset intercept window can be set by the user according to actual needs. The value range is the valid range of the business data and can reflect certain characteristics of the business data.
S202、根据预设处理规则对所述业务数据的取值范围进行处理。S202. Process the value range of the service data according to a preset processing rule.
在本实施例中,所述根据预设处理规则对所述业务数据的取值范围进行处理,包括:对所述业务数据的取值范围进行滤波降噪处理或归一化处理等,目的是为以后的离散化后更好地应用在数据挖掘或机器学习上。其中,滤波降噪处理或归一化处理的方法采用现有的方法,在此不做详细介绍。In this embodiment, the process of processing the value range of the service data according to the preset processing rule includes: filtering, reducing, or normalizing the value range of the service data, etc., Better applied to data mining or machine learning for future discretization. Among them, the method of filtering noise reduction processing or normalization processing adopts the existing method, and will not be described in detail herein.
S203、基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间。S203. Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals.
在本实施例中,基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,所述离散数据集合包括多个数据区间。将多少所述数据区间进行排序统计其出现次数,再根据信息熵的计算公式即可计算出所述离散数据集合的信息熵。In this embodiment, based on the entropy-based data discretization, the value range of the service data is discretized to generate a corresponding discrete data set, and the discrete data set includes a plurality of data intervals. How many times the data interval is sorted to count the number of occurrences, and then the information entropy of the discrete data set can be calculated according to the calculation formula of the information entropy.
S204、预合并所述离散数据集合中相邻的两个数据区间以得到多个预合并 数据区间,并计算所述预合并数据区间的信息熵。S204. Pre-merging two adjacent data segments in the discrete data set to obtain a plurality of pre-combined data intervals, and calculating information entropy of the pre-merged data interval.
在本实施例中,比如,离散数据集合为S0,其包括多个数据区间记为S00、S01、S02...S0n。合并所述离散数据集合中相邻的两个数据区间则会产生新的数据区间,比如(S00、S01)、(S01、S02)....(S0n-1、S0n),这些新的数据区间即为预合并数据区间,利用信息熵的计算公式分别计算这些预合并数据区间对应的信息熵,这些预合并数据区间对应的信息熵会有大有小,并查找其中具有最大信息熵的预合并数据区间。In the present embodiment, for example, the discrete data set is S0, which includes a plurality of data intervals denoted as S00, S01, S02, ..., S0n. Merging the adjacent two data intervals in the discrete data set will generate new data intervals, such as (S00, S01), (S01, S02), ... (S0n-1, S0n), these new data The interval is the pre-combined data interval. The information entropy corresponding to these pre-combined data intervals is calculated by the calculation formula of information entropy. The information entropy corresponding to these pre-merged data intervals will be large and small, and the pre-existing data entropy will be found. Combine data ranges.
S205、将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率。S205. Combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy and an interval loss rate of the target data set.
在本实施例中,由于真正合并了其中两个数据区间,即所述离散数据集合中具有最大信息熵的预合并数据区间,因此原来的离散数据集合相对于该目标数据集合会出现数据区间和信息熵的损失,由此还需计算出该目标数据集合的信息熵以及相应的区间损失率。具体采用上述实施例中的区间损失率计算公式进行计算。In this embodiment, since two data intervals, that is, pre-combined data intervals having the largest information entropy in the discrete data set, are actually merged, the original discrete data set may have a data interval relative to the target data set. The loss of information entropy, and thus the information entropy of the target data set and the corresponding interval loss rate. Specifically, the interval loss rate calculation formula in the above embodiment is used for calculation.
S206、根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率。S206. Calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.
在本实施例中,具体地,根据所述离散数据集合的信息熵和所述目标数据集合的信息熵采用表达式1-3计算所述目标数据集合的熵损失率。In this embodiment, specifically, the entropy loss rate of the target data set is calculated according to the information entropy of the discrete data set and the information entropy of the target data set by using Expressions 1-3.
S207、判断所述熵损失率是否大于所述区间损失率。S207. Determine whether the entropy loss rate is greater than the interval loss rate.
在本实施例中,判断所述熵损失率是否大于所述区间损失率,产生两种判断结果。具体地,若所述熵损失率大于所述区间损失率,则执行步骤S208;若所述熵损失率不大于所述区间损失率,则执行步骤S208。In this embodiment, it is determined whether the entropy loss rate is greater than the interval loss rate, and two determination results are generated. Specifically, if the entropy loss rate is greater than the interval loss rate, step S208 is performed; if the entropy loss rate is not greater than the interval loss rate, step S208 is performed.
S208、若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。S208. If the entropy loss rate is greater than the interval loss rate, output the target data set to complete data discretization of the value range of the service data.
在本实施例中,若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化,具体可将所述目标数据集合进行保存并保存地址信息发送给用户,由于用户根据需要进行提取该目标数据集合,比如用于数据挖掘或机器学习中的模型训练等。In this embodiment, if the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data, and specifically, the target data set may be performed. The save and save address information is sent to the user, because the user extracts the target data set as needed, such as model training for data mining or machine learning.
S209、若所述熵损失率不大于所述区间损失率,将所述目标数据集合设为所述离散数据集合并返回执行所述根据预设合并规则预合并所述离散数据集合 中的数据区间以得到多个预合并数据区间的步骤,直至所述熵损失率大于所述区间损失率。S209. If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data interval in the discrete data set according to a preset merge rule. The step of obtaining a plurality of pre-merged data intervals until the entropy loss rate is greater than the interval loss rate.
在本实施例中,若所述熵损失率不大于所述区间损失率,则将所述目标数据集合作为所述离散数据集合执行上述步骤S204至S207进行下一轮的数据区间的合并,依次循环直至所述熵损失率大于所述区间损失率,停止继续循环合并,其中所述熵损失率大于所述区间损失率对应的那个目标数据集合就是最后所需要的数据离散化的结果。In this embodiment, if the entropy loss rate is not greater than the interval loss rate, the target data set is executed as the discrete data set, and the steps S204 to S207 are performed to perform the next round of data interval merging. The cycle is continued until the entropy loss rate is greater than the interval loss rate, and the continuous loop merging is stopped, wherein the target data set corresponding to the entropy loss rate being greater than the interval loss rate is the result of the last required data discretization.
请参阅图3,图3是本申请实施例提供的一种数据离散化装置的示意性框图。该数据离散化装置300可以安装于服务器或终端中。如图3所示,数据离散化装置300包括:离散生成计算单元301、第一合并计算单元302、第二合并计算单元303、熵损失率计算单元304、损失率判断单元305、数据集合输出单元306和返回循环执行单元307。Please refer to FIG. 3. FIG. 3 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application. The data discretization device 300 can be installed in a server or a terminal. As shown in FIG. 3, the data discretization apparatus 300 includes: a discrete generation calculation unit 301, a first merge calculation unit 302, a second merge calculation unit 303, an entropy loss rate calculation unit 304, a loss rate determination unit 305, and a data set output unit. 306 and return to loop execution unit 307.
离散生成计算单元301,用于基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间。a discrete generation calculation unit 301, configured to discretize the value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval.
第一合并计算单元302,用于根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵。The first merge calculation unit 302 is configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval.
第二合并计算单元303,用于将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率。The second merge calculation unit 303 is configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate.
熵损失率计算单元304,用于根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率。The entropy loss rate calculation unit 304 is configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.
损失率判断单元305,用于判断所述熵损失率是否大于所述区间损失率。The loss rate determining unit 305 is configured to determine whether the entropy loss rate is greater than the interval loss rate.
具体地,若损失率判断单元305判断出所述熵损失率大于所述区间损失率,则调用数据集合输出单元306;若损失率判断单元305判断出所述熵损失率不大于所述区间损失率,则调用返回循环执行单元307。Specifically, if the loss rate determination unit 305 determines that the entropy loss rate is greater than the interval loss rate, the data set output unit 306 is invoked; if the loss rate determination unit 305 determines that the entropy loss rate is not greater than the interval loss The rate is then returned to the loop execution unit 307.
数据集合输出单元306,用于输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。The data set output unit 306 is configured to output the target data set to complete data discretization of the value range of the service data.
返回循环执行单元307,用于将所述目标数据集合设为所述离散数据集合并 返回执行所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间的步骤,直至所述熵损失率大于所述区间损失率。Returning to the loop execution unit 307, configured to set the target data set as the discrete data set and return to perform performing the pre-merging data sections in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals. The step until the entropy loss rate is greater than the interval loss rate.
请参阅图4,图4是本申请实施例提供的一种数据离散化装置的示意性框图。该数据离散化装置400可以安装于服务器或终端中。如图4所示,数据离散化装置400包括:取值范围确定单元401、取值范围处理单元402、离散生成计算单元403、第一合并计算单元404、第二合并计算单元405、熵损失率计算单元406、损失率判断单元407、数据集合输出单元408和返回循环执行单元409。Please refer to FIG. 4. FIG. 4 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application. The data discretization device 400 can be installed in a server or a terminal. As shown in FIG. 4, the data discretization apparatus 400 includes: a value range determining unit 401, a value range processing unit 402, a discrete generation calculating unit 403, a first combining calculating unit 404, a second combining calculating unit 405, and an entropy loss rate. The calculation unit 406, the loss rate determination unit 407, the data set output unit 408, and the return loop execution unit 409.
取值范围确定单元401,用于获取目标业务的业务数据并确定所述业务数据的取值范围。The value range determining unit 401 is configured to obtain service data of the target service and determine a value range of the service data.
取值范围处理单元402,用于根据预设处理规则对所述业务数据的取值范围进行处理。The value range processing unit 402 is configured to process the value range of the service data according to a preset processing rule.
离散生成计算单元403,用于基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间。a discrete generation calculation unit 403, for entropy-based data discretization, discretizing a value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval.
第一合并计算单元404,用于预合并所述离散数据集合中相邻的两个数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵。The first merge calculation unit 404 is configured to pre-merge two adjacent data segments in the discrete data set to obtain a plurality of pre-combined data intervals, and calculate information entropy of the pre-merged data interval.
第二合并计算单元405,用于将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率。The second merge calculation unit 405 is configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate.
熵损失率计算单元406,用于根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率。The entropy loss rate calculation unit 406 is configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.
损失率判断单元407,用于判断所述熵损失率是否大于所述区间损失率。The loss rate determining unit 407 is configured to determine whether the entropy loss rate is greater than the interval loss rate.
具体地,若损失率判断单元407判断出所述熵损失率大于所述区间损失率,则调用数据集合输出单元408;若损失率判断单元407判断出所述熵损失率不大于所述区间损失率,则调用返回循环执行单元409。Specifically, if the loss rate determining unit 407 determines that the entropy loss rate is greater than the interval loss rate, the data set output unit 408 is invoked; if the loss rate determining unit 407 determines that the entropy loss rate is not greater than the interval loss The rate is then returned to the loop execution unit 409.
数据集合输出单元408,用于若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。The data set output unit 408 is configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.
返回循环执行单元409,用于若所述熵损失率不大于所述区间损失率,将所述目标数据集合设为所述离散数据集合并返回执行所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间的步骤,直至所 述熵损失率大于所述区间损失率。Returning to the loop execution unit 409, if the entropy loss rate is not greater than the interval loss rate, setting the target data set as the discrete data set and returning to performing the pre-merging the discrete according to a preset merge rule The data interval in the data set to obtain a plurality of pre-combined data intervals until the entropy loss rate is greater than the interval loss rate.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的数据离散化装置和单元的具体工作过程,可以参考前述的数据离散化方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the data discretization device and the unit described above can refer to the corresponding process in the foregoing data discretization method embodiment, and Let me repeat.
上述装置可以实现为一种计算机程序的形式,计算机程序可以在如图5所示的计算机设备上运行。The above apparatus may be embodied in the form of a computer program that can be run on a computer device as shown in FIG.
请参阅图5,图5是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500设备可以是终端或服务器。Referring to FIG. 5, FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 device can be a terminal or a server.
参照图5,该计算机设备500包括通过系统总线510连接的处理器520、存储器和网络接口550,其中,存储器可以包括非易失性存储介质530和内存储器540。Referring to FIG. 5, the computer device 500 includes a processor 520, a memory and a network interface 550 connected by a system bus 510, wherein the memory can include a non-volatile storage medium 530 and an internal memory 540.
该非易失性存储介质530可存储操作系统531和计算机程序532。该计算机程序532被执行时,可使得处理器520执行一种数据离散化方法。The non-volatile storage medium 530 can store an operating system 531 and a computer program 532. When the computer program 532 is executed, the processor 520 can be caused to perform a data discretization method.
该处理器520用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 520 is used to provide computing and control capabilities to support the operation of the entire computer device 500.
该内存储器540为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器520执行时,可使得处理器520执行一种数据离散化方法。The internal memory 540 provides an environment for the operation of a computer program in a non-volatile storage medium that, when executed by the processor 520, causes the processor 520 to perform a data discretization method.
该网络接口550用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 550 is used for network communication, such as sending assigned tasks and the like. It will be understood by those skilled in the art that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device 500 to which the solution of the present application is applied, and a specific computer device. 500 may include more or fewer components than shown, or some components may be combined, or have different component arrangements.
其中,所述处理器520用于运行存储在存储器中的程序代码,以实现上述实施例中的数据离散化方法对应的流程步骤。The processor 520 is configured to run the program code stored in the memory to implement the process steps corresponding to the data discretization method in the foregoing embodiment.
应当理解,在本申请实施例中,处理器520可以是中央处理单元(Central Processing Unit,CPU),该处理器520还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in the embodiment of the present application, the processor 520 may be a central processing unit (CPU), and the processor 520 may also be other general-purpose processors, a digital signal processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
本领域技术人员可以理解,图5中示出的计算机设备500结构并不构成对 计算机设备500的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art will appreciate that the computer device 500 architecture illustrated in FIG. 5 does not constitute a limitation to computer device 500, may include more or fewer components than illustrated, or may combine certain components, or different components. Arrangement.
本领域普通技术人员可以理解的是实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,程序可存储于一存储介质中,该存储介质为计算机可读存储介质。如本申请实施例中,该程序可存储于计算机系统的存储介质中,并被该计算机系统中的至少一个处理器执行,以实现包括如上述各方法的实施例的流程步骤。It will be understood by those skilled in the art that all or part of the processes in the above embodiments may be implemented by a computer program to instruct related hardware, and the program may be stored in a storage medium, which is readable by a computer. Storage medium. As in the embodiment of the present application, the program may be stored in a storage medium of the computer system and executed by at least one processor in the computer system to implement the flow steps including the embodiments of the methods described above.
该计算机可读存储介质可以是磁碟、光盘、U盘、移动硬盘、随机存储记忆体(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The computer readable storage medium may be a medium that can store program code, such as a magnetic disk, an optical disk, a USB flash drive, a mobile hard disk, a random access memory (RAM), a magnetic disk, or an optical disk.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both, for clarity of hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。The steps in the method of the embodiment of the present application may be sequentially adjusted, merged, and deleted according to actual needs.
本申请实施例装置中的单元可以根据实际需要进行合并、划分和删减。The units in the apparatus of the embodiment of the present application may be combined, divided, and deleted according to actual needs.
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,终端,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be in essence or part of the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. There are a number of instructions for causing a computer device (which may be a personal computer, terminal, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any equivalents can be easily conceived by those skilled in the art within the technical scope disclosed in the present application. Modifications or substitutions are intended to be included within the scope of the present application. Therefore, the scope of protection of this application should be determined by the scope of protection of the claims.

Claims (20)

  1. 一种数据离散化方法,包括:A data discretization method comprising:
    基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间;Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;
    根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵;Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;
    将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率;Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;
    根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率;Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;
    判断所述熵损失率是否大于所述区间损失率;Determining whether the entropy loss rate is greater than the interval loss rate;
    若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.
  2. 根据权利要求1所述的数据离散化方法,其中,所述判断所述熵损失率是否大于所述区间损失率之后,还包括:The data discretization method according to claim 1, wherein the determining whether the entropy loss rate is greater than the interval loss rate further comprises:
    若所述熵损失率不大于所述区间损失率,将所述目标数据集合设为所述离散数据集合并返回执行所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间的步骤,直至所述熵损失率大于所述区间损失率。If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of steps of pre-merging the data intervals until the entropy loss rate is greater than the interval loss rate.
  3. 根据权利要求1所述的数据离散化方法,其中,所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,包括:The data discretization method according to claim 1, wherein the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals comprises:
    预合并所述离散数据集合中相邻的两个数据区间以得到多个预合并数据区间。Adjacent two data intervals in the discrete data set are pre-combined to obtain a plurality of pre-combined data intervals.
  4. 根据权利要求1所述的数据离散化方法,其中,所述计算所述离散数据集合的信息熵以及计算所述预合并数据区间的信息熵,包括:The data discretization method according to claim 1, wherein the calculating the information entropy of the discrete data set and calculating the information entropy of the pre-merged data interval comprises:
    采用信息熵的计算公式计算所述离散数据集合的信息熵以及计算所述预合并数据区间的信息熵,所述信息熵的计算公式为:Calculating an information entropy of the discrete data set and calculating an information entropy of the pre-merged data interval by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:
    Figure PCTCN2018077137-appb-100001
    Figure PCTCN2018077137-appb-100001
    其中,n为大于1的正整数,i为1到n之间的正整数,p i为第i数据出现的概率,H(p)为信息熵。 Where n is a positive integer greater than 1, i is a positive integer between 1 and n, p i is the probability of occurrence of the ith data, and H(p) is information entropy.
  5. 根据权利要求4所述的数据离散化方法,其中,所述计算所述目标数据集合的区间损失率,包括:采用预设区间损失率公式计算所述目标数据集合的区间损失率,所述预设区间损失率公式为:The data discretization method according to claim 4, wherein the calculating the interval loss rate of the target data set comprises: calculating a section loss rate of the target data set by using a preset interval loss rate formula, the pre- Let the interval loss rate formula be:
    L q=x/N L q =x/N
    其中,L q为区间损失率,x为每次合并后损失的数据区间数,N为离散数据集合的数据区间数; Where L q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set;
    所述根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率,包括:根据所述离散数据集合的信息熵和所述目标数据集合的信息熵采用预设熵损失率公式计算所述目标数据集合的熵损失率,所述预设熵损失率公式为:Calculating, according to the information entropy of the discrete data set and the information entropy of the target data set, an entropy loss rate of the target data set, comprising: information entropy according to the discrete data set and information of the target data set The entropy uses a preset entropy loss rate formula to calculate an entropy loss rate of the target data set, and the preset entropy loss rate formula is:
    H q=(G 0-G)/G 0 H q =(G 0 -G)/G 0
    其中,H q为熵损失率,G 0为所述离散数据集合的信息熵,G为所述目标数据集合的信息熵。 Where H q is the entropy loss rate, G 0 is the information entropy of the discrete data set, and G is the information entropy of the target data set.
  6. 根据权利要求1所述的数据离散化方法,其中,所述基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合之前,还包括:The data discretization method according to claim 1, wherein the entropy-based data discretization, before discretizing the value range of the service data to generate the corresponding discrete data set, further comprises:
    获取目标业务的业务数据并确定所述业务数据的取值范围;Obtaining service data of the target service and determining a value range of the service data;
    根据预设处理规则对所述业务数据的取值范围进行处理。The value range of the service data is processed according to a preset processing rule.
  7. 根据权利要求6所述的数据离散化方法,其中,所述根据预设处理规则对所述业务数据的取值范围进行处理,包括:对所述业务数据的取值范围进行滤波降噪处理或归一化处理。The data discretization method according to claim 6, wherein the processing the value range of the service data according to a preset processing rule comprises: filtering and noise-reducing the value range of the service data or Normalized processing.
  8. 一种数据离散化装置,包括:A data discretization device comprising:
    离散生成计算单元,用于基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间;a discrete generation calculation unit, configured to discretize data based on entropy, discretize a value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval
    第一合并计算单元,用于根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵;a first merge calculation unit, configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval;
    第二合并计算单元,用于将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率;a second merge calculation unit, configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate;
    熵损失率计算单元,用于根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率;An entropy loss rate calculation unit, configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;
    损失率判断单元,用于判断所述熵损失率是否大于所述区间损失率;a loss rate determining unit, configured to determine whether the entropy loss rate is greater than the interval loss rate;
    数据集合输出单元,用于若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。And a data set output unit, configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.
  9. 根据权利要求8所述的数据离散化装置,其中,还包括:The data discretization apparatus according to claim 8, further comprising:
    返回循环执行单元,用于若所述熵损失率不大于所述区间损失率,将所述目标数据集合设为所述离散数据集合并返回执行所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间的步骤,直至所述熵损失率大于所述区间损失率。Returning to the loop execution unit, if the entropy loss rate is not greater than the interval loss rate, setting the target data set as the discrete data set and returning to performing the pre-merging the discrete data according to a preset merge rule The data interval in the set to obtain a plurality of pre-combined data intervals until the entropy loss rate is greater than the interval loss rate.
  10. 根据权利要求8所述的数据离散化装置,其中,所述第一合并计算单元,具体用于预合并所述离散数据集合中相邻的两个数据区间以得到多个预合并数据区间。The data discretization apparatus according to claim 8, wherein the first merge calculation unit is specifically configured to pre-merge two adjacent data intervals in the discrete data set to obtain a plurality of pre-combined data intervals.
  11. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device comprising a memory, a processor, and a computer program stored on the memory and operative on the processor, the processor executing the computer program to implement the following steps:
    基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间;Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;
    根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵;Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;
    将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率;Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;
    根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率;Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;
    判断所述熵损失率是否大于所述区间损失率;Determining whether the entropy loss rate is greater than the interval loss rate;
    若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所 述业务数据的取值范围的数据离散化。If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.
  12. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述计算机程序时实现以下步骤:The computer device of claim 11 wherein said processor, when executing said computer program, implements the following steps:
    若所述熵损失率不大于所述区间损失率,将所述目标数据集合设为所述离散数据集合并返回执行所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间的步骤,直至所述熵损失率大于所述区间损失率。If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of steps of pre-merging the data intervals until the entropy loss rate is greater than the interval loss rate.
  13. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述计算机程序时实现以下步骤:The computer device of claim 11 wherein said processor, when executing said computer program, implements the following steps:
    预合并所述离散数据集合中相邻的两个数据区间以得到多个预合并数据区间。Adjacent two data intervals in the discrete data set are pre-combined to obtain a plurality of pre-combined data intervals.
  14. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述计算机程序时实现以下步骤:The computer device of claim 11 wherein said processor, when executing said computer program, implements the following steps:
    采用信息熵的计算公式计算所述离散数据集合的信息熵以及计算所述预合并数据区间的信息熵,所述信息熵的计算公式为:Calculating an information entropy of the discrete data set and calculating an information entropy of the pre-merged data interval by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:
    Figure PCTCN2018077137-appb-100002
    Figure PCTCN2018077137-appb-100002
    其中,n为大于1的正整数,i为1到n之间的正整数,p i为第i数据出现的概率,H(p)为信息熵。 Where n is a positive integer greater than 1, i is a positive integer between 1 and n, p i is the probability of occurrence of the ith data, and H(p) is information entropy.
  15. 根据权利要求14所述的计算机设备,其中,所述处理器执行所述计算机程序时实现以下步骤:The computer device of claim 14, wherein the processor, when executing the computer program, implements the following steps:
    采用预设区间损失率公式计算所述目标数据集合的区间损失率,所述预设区间损失率公式为:Calculating an interval loss rate of the target data set by using a preset interval loss rate formula, where the preset interval loss rate formula is:
    L q=x/N L q =x/N
    其中,L q为区间损失率,x为每次合并后损失的数据区间数,N为离散数据集合的数据区间数; Where L q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set;
    根据所述离散数据集合的信息熵和所述目标数据集合的信息熵采用预设熵损失率公式计算所述目标数据集合的熵损失率,所述预设熵损失率公式为:Calculating an entropy loss rate of the target data set according to an information entropy of the discrete data set and an information entropy of the target data set by using a preset entropy loss rate formula, where the preset entropy loss rate formula is:
    H q=(G 0-G)/G 0 H q =(G 0 -G)/G 0
    其中,H q为熵损失率,G 0为所述离散数据集合的信息熵,G为所述目标数 据集合的信息熵。 Where H q is the entropy loss rate, G 0 is the information entropy of the discrete data set, and G is the information entropy of the target data set.
  16. 一种存储介质,所述存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下步骤:A storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the following steps:
    基于熵的数据离散化,将业务数据的取值范围离散以生成相应的离散数据集合,并计算所述离散数据集合的信息熵,其中所述离散数据集合包括多个数据区间;Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;
    根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间,并计算所述预合并数据区间的信息熵;Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;
    将所述离散数据集合中具有最大信息熵的预合并数据区间进行合并作为目标数据集合,并计算所述目标数据集合的信息熵以及区间损失率;Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;
    根据所述离散数据集合的信息熵和所述目标数据集合的信息熵计算所述目标数据集合的熵损失率;Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;
    判断所述熵损失率是否大于所述区间损失率;Determining whether the entropy loss rate is greater than the interval loss rate;
    若所述熵损失率大于所述区间损失率,输出所述目标数据集合以完成对所述业务数据的取值范围的数据离散化。If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.
  17. 根据权利要求16所述的存储介质,其中,所述程序指令当被处理器执行时使所述处理器执行以下步骤:The storage medium of claim 16 wherein said program instructions, when executed by a processor, cause said processor to perform the following steps:
    若所述熵损失率不大于所述区间损失率,将所述目标数据集合设为所述离散数据集合并返回执行所述根据预设合并规则预合并所述离散数据集合中的数据区间以得到多个预合并数据区间的步骤,直至所述熵损失率大于所述区间损失率。If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of steps of pre-merging the data intervals until the entropy loss rate is greater than the interval loss rate.
  18. 根据权利要求16所述的存储介质,其中,所述程序指令当被处理器执行时使所述处理器执行以下步骤:The storage medium of claim 16 wherein said program instructions, when executed by a processor, cause said processor to perform the following steps:
    预合并所述离散数据集合中相邻的两个数据区间以得到多个预合并数据区间。Adjacent two data intervals in the discrete data set are pre-combined to obtain a plurality of pre-combined data intervals.
  19. 根据权利要求16所述的存储介质,其中,所述程序指令当被处理器执行时使所述处理器执行以下步骤:The storage medium of claim 16 wherein said program instructions, when executed by a processor, cause said processor to perform the following steps:
    采用信息熵的计算公式计算所述离散数据集合的信息熵以及计算所述预合并数据区间的信息熵,所述信息熵的计算公式为:Calculating an information entropy of the discrete data set and calculating an information entropy of the pre-merged data interval by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:
    Figure PCTCN2018077137-appb-100003
    Figure PCTCN2018077137-appb-100003
    其中,n为大于1的正整数,i为1到n之间的正整数,p i为第i数据出现的概率,H(p)为信息熵。 Where n is a positive integer greater than 1, i is a positive integer between 1 and n, p i is the probability of occurrence of the ith data, and H(p) is information entropy.
  20. 根据权利要求19所述的存储介质,其中,所述程序指令当被处理器执行时使所述处理器执行以下步骤:The storage medium of claim 19, wherein the program instructions, when executed by a processor, cause the processor to perform the following steps:
    采用预设区间损失率公式计算所述目标数据集合的区间损失率,所述预设区间损失率公式为:Calculating an interval loss rate of the target data set by using a preset interval loss rate formula, where the preset interval loss rate formula is:
    L q=x/N L q =x/N
    其中,L q为区间损失率,x为每次合并后损失的数据区间数,N为离散数据集合的数据区间数; Where L q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set;
    根据所述离散数据集合的信息熵和所述目标数据集合的信息熵采用预设熵损失率公式计算所述目标数据集合的熵损失率,所述预设熵损失率公式为:Calculating an entropy loss rate of the target data set according to an information entropy of the discrete data set and an information entropy of the target data set by using a preset entropy loss rate formula, where the preset entropy loss rate formula is:
    H q=(G 0-G)/G 0 H q =(G 0 -G)/G 0
    其中,H q为熵损失率,G 0为所述离散数据集合的信息熵,G为所述目标数据集合的信息熵。 Where H q is the entropy loss rate, G 0 is the information entropy of the discrete data set, and G is the information entropy of the target data set.
PCT/CN2018/077137 2018-01-12 2018-02-24 Data discretisation method and apparatus, computer device and storage medium WO2019136799A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810031540.4A CN108170837A (en) 2018-01-12 2018-01-12 Method of Data Discretization, device, computer equipment and storage medium
CN201810031540.4 2018-01-12

Publications (1)

Publication Number Publication Date
WO2019136799A1 true WO2019136799A1 (en) 2019-07-18

Family

ID=62514636

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077137 WO2019136799A1 (en) 2018-01-12 2018-02-24 Data discretisation method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN108170837A (en)
WO (1) WO2019136799A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021251801A1 (en) * 2020-06-12 2021-12-16 한국전기연구원 Temperature discretization digital device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10922139B2 (en) * 2018-10-11 2021-02-16 Visa International Service Association System, method, and computer program product for processing large data sets by balancing entropy between distributed data segments
CN112418258A (en) * 2019-08-22 2021-02-26 北京京东振世信息技术有限公司 Feature discretization method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120226709A1 (en) * 2011-01-05 2012-09-06 The Board Of Trustees Of The University Of Illinois Automated prostate tissue referencing for cancer detection and diagnosis
CN106407304A (en) * 2016-08-30 2017-02-15 北京大学 Mutual information-based data discretization and feature selection integrated method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120226709A1 (en) * 2011-01-05 2012-09-06 The Board Of Trustees Of The University Of Illinois Automated prostate tissue referencing for cancer detection and diagnosis
CN106407304A (en) * 2016-08-30 2017-02-15 北京大学 Mutual information-based data discretization and feature selection integrated method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QUE, XIA ET AL.: "A Method of Discrtization of Continuous Attributes Based on Interval Class-Entropy", PROGRESS OF COMPUTER TECHNOLOGY AND APPLICATON IN 2006, 1 July 2006 (2006-07-01) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021251801A1 (en) * 2020-06-12 2021-12-16 한국전기연구원 Temperature discretization digital device

Also Published As

Publication number Publication date
CN108170837A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN105389349B (en) Dictionary update method and device
CN105183923B (en) New word discovery method and device
WO2020147488A1 (en) Method and device for identifying irregular group
EP3819785A1 (en) Feature word determining method, apparatus, and server
WO2016180268A1 (en) Text aggregate method and device
WO2019136799A1 (en) Data discretisation method and apparatus, computer device and storage medium
US11748305B2 (en) Suggesting a destination folder for a file to be saved
US20230359625A1 (en) Alert rule evaluation for monitoring of late arriving data
WO2021135603A1 (en) Intention recognition method, server and storage medium
CN111159184A (en) Metadata tracing method and device and server
CN110347900B (en) Keyword importance calculation method, device, server and medium
Satish et al. Big data processing with harnessing hadoop-MapReduce for optimizing analytical workloads
WO2022116444A1 (en) Text classification method and apparatus, and computer device and medium
WO2021169217A1 (en) Abstract extraction method and apparatus, device, and computer-readable storage medium
WO2021189845A1 (en) Detection method and apparatus for time series anomaly point, and device and readable storage medium
Fang et al. Quicklogs: A quick log parsing algorithm based on template similarity
CN110022343B (en) Adaptive event aggregation
US10229223B2 (en) Mining relevant approximate subgraphs from multigraphs
CN107368281B (en) Data processing method and device
CN115329173A (en) Method and device for determining enterprise credit based on public opinion monitoring
Sarma et al. Finding with the crowd
CN113297854A (en) Method, device and equipment for mapping text to knowledge graph entity and storage medium
US20230214409A1 (en) Merging totally ordered sets
Huang et al. DTW-based subsequence similarity search on AMD heterogeneous computing platform
JP2019160008A (en) Program analyzer and program analysis method

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/10/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899821

Country of ref document: EP

Kind code of ref document: A1