WO2019136799A1

WO2019136799A1 - Data discretisation method and apparatus, computer device and storage medium

Info

Publication number: WO2019136799A1
Application number: PCT/CN2018/077137
Authority: WO
Inventors: 晏存
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-01-12
Filing date: 2018-02-24
Publication date: 2019-07-18
Also published as: CN108170837A

Abstract

Disclosed in the present application are a data discretisation method and apparatus, a computer device, and a storage medium, the method comprising: entropy-based data discretisation discretises the value range of service data into a discrete data set, the discrete data set comprising a plurality of data intervals; using a preset merging rule to merge the data intervals until the entropy loss rate of the merged data set is greater than the interval loss rate; and outputting a target data set to complete the data discretisation of the service data.

Description

Data discretization method, device, computer device and storage medium

This application claims priority to Chinese Patent Application No. 201810031540.4, filed on Jan. 12, 2018, entitled "Data Discretization Method, Apparatus, Computer Equipment, and Storage Media", the entire contents of which are incorporated by reference. Combined in this application.

Technical field

The present application relates to the field of data processing technologies, and in particular, to a data discretization method, apparatus, computer device, and storage medium.

Background technique

At present, in the era of big data informationization, the database is getting bigger and bigger. People urgently need to perform data mining on huge databases to get valuable information. Since the collected data is mostly continuous, in order to better carry out knowledge. Discovery and rule extraction, data discretization technology becomes the key, and the discretization of continuous attributes is an important pre-processing step of data mining and machine learning, which is directly related to the effect of learning. In the classification algorithm, the discretization preprocessing of the training sample set has dual significance. On the one hand, it can effectively reduce the complexity of the learning algorithm, speed up the learning, and even improve the learning classification accuracy; on the other hand, it can be simplified and summarized. Knowledge to improve the comprehensibility of classification results. Therefore, the discretization problem has been extensively and deeply studied. Data discretization of equal-width and equal-frequency interval methods is a common discretization algorithm. Although it is easy to implement, it is difficult to set the interval boundaries at the most suitable position because the sample distribution information is neglected, so that their performance is large. In most cases, satisfactory results cannot be achieved.

Summary of the invention

The application provides a data discretization method, device, computer device and storage medium to improve the training effect of machine learning.

In a first aspect, the present application provides a data discretization method, the method comprising:

Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;

Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;

Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;

Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;

Determining whether the entropy loss rate is greater than the interval loss rate;

If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.

In a second aspect, the present application provides a data discretization device, the device comprising:

a discrete generation calculation unit, configured to discretize data based on entropy, discretize a value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval

a first merge calculation unit, configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval;

a second merge calculation unit, configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate;

An entropy loss rate calculation unit, configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;

a loss rate determining unit, configured to determine whether the entropy loss rate is greater than the interval loss rate;

And a data set output unit, configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.

In a third aspect, the present application also provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program The data discretization method according to any one of the preceding claims is implemented.

In a fourth aspect, the present application also provides a storage medium, wherein the storage medium stores a computer program, the computer program comprising program instructions, the program instructions, when executed by a processor, causing the processor to execute the application A data discretization method as claimed in any of the preceding claims.

The embodiment of the present application discretizes the value range of the service data into a discrete data set by using entropy-based data discretization, wherein the discrete data set includes multiple data intervals; and the data interval is merged by using a preset merge rule until the merged The entropy loss rate of the data set is greater than the interval loss rate, so that the discrete interval of the merged data set is as small as possible and the entropy is as large as possible, thereby improving the effect of data discretization and facilitating data mining and machine learning.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. Obviously, the drawings in the following description are some embodiments of the present application, For the ordinary technicians, other drawings can be obtained based on these drawings without any creative work.

1 is a schematic flow chart of a data discretization method according to an embodiment of the present application;

2 is a schematic flow chart of a data discretization method according to another embodiment of the present application;

FIG. 3 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application; FIG.

4 is a schematic block diagram of a data discretization apparatus according to another embodiment of the present application;

FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

The use of the terms "comprising", "comprising", "","," The presence or addition of a plurality of other features, integers, steps, operations, elements, components, and/or collections thereof.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used in the specification and the appended claims, the claims

It is further understood that the term "and/or" used in the specification and the appended claims means any combination and all possible combinations of one or more of the associated listed items, and .

Please refer to FIG. 1. FIG. 1 is a schematic flow chart of a data discretization method according to an embodiment of the present application. As shown in FIG. 1, the data discretization method includes steps S101 to S107.

S101. Entropy-based data discretization, discretizing a value range of service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals.

In this embodiment, the attribute of the service data is a continuous attribute. Based on entropy-based data discretization, the continuous range of values is divided into multiple cells. These cells are data intervals, and multiple data intervals form a discrete data set.

Wherein, based on entropy-based data discretization, the value range of the service data is discretized to generate a corresponding discrete data set, and the splitting point may be first determined, and the continuous value is discretized according to the splitting point, for example, using existing discrete Attribute A, select the value of A with the smallest entropy as the split point, and recursively divide the data interval to get the discrete data set.

The information entropy of the discrete data set is calculated, and the information entropy of the discrete data set is calculated by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:

In Expression 1-1, n is a positive integer greater than 1, i is a positive integer between 1 and n, p _i is the probability of occurrence of the ith data, and H(p) is information entropy.

Specifically, calculating the information entropy of the discrete data set by using a calculation formula of information entropy, firstly, the data intervals are arranged in order from small to large, and the number of occurrences of each of the data intervals is counted, according to the number of occurrences The probability distribution of the data interval can be calculated. The information entropy of the discrete data set can be calculated by using the expression 1-1 according to the probability of the data interval, and is recorded as G0.

S102. Pre-merging data segments in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculating an information entropy of the pre-merged data interval.

In this embodiment, the preset merge rule is to merge the data intervals in the discrete data set by using a preset manner, where the preset merge rule is, for example, merging two adjacent ones of the discrete data sets. A data interval, or an merging of two alternating data intervals in the discrete data set. It should be noted that, in the same embodiment, only the same preset merge rule is used, for example, the two adjacent data segments in the discrete data set are combined, and the merged discrete data is used in the subsequent loop merge mode. The way two adjacent data intervals are in the collection.

For example, the discrete data set is S0, which includes a plurality of data intervals denoted as S00, S01, S02...S0n. S00 and S01, S01 and S02 are two adjacent data intervals, and the two alternate data intervals are S00 and S02, S01 and S03. Merging the adjacent two data intervals in the discrete data set will generate new data intervals, such as (S00, S01), (S01, S02), ... (S0n-1, S0n), these new data The interval is the pre-combined data interval. The information entropy corresponding to these pre-combined data intervals is calculated by the calculation formula of information entropy. The information entropy corresponding to these pre-merged data intervals will be large and small, and the pre-existing data entropy will be found. Combine data ranges.

S103. Combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy and an interval loss rate of the target data set.

In this embodiment, for example, in step S102, the pre-merged data interval having the largest information entropy in the discrete data set is found as (S02, S03), that is, the information entropy corresponding to the pre-merged data interval is compared with other pre-merging. The information entropy corresponding to the data interval is large, and the pre-merged data interval is truly merged, and is recorded as AS0203, that is, the pre-merged data interval having the largest information entropy in the discrete data set is combined as the target data set. Therefore, the data range included in the target data set is S00, S01, AS0203, S04...S0n. Since the data interval with the largest entropy is merged, the information entropy of the target data set changes. Therefore, it is necessary to recalculate the information entropy corresponding to the target data set according to the calculation formula of the information entropy, which is recorded as G1.

Since the two data intervals are truly merged, the original discrete data set will have a loss of data interval and information entropy relative to the target data set, and thus the interval loss rate of the target data set can also be calculated.

Specifically, the interval loss rate of the target data set may be calculated by using a preset interval loss rate formula, where the preset interval loss rate formula is:

L _q =x/N (1-2)

Where L _q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set.

In the present embodiment, since it is the first merge, the interval loss rate of the target data set is denoted as L ₁ . The interval loss rate L ₁ =1/N of the target data set can be calculated from the preset interval loss rate formula.

S104. Calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.

In this embodiment, the entropy loss rate of the target data set is calculated according to the information entropy of the discrete data set and the information entropy of the target data set by using a preset entropy loss rate formula, the preset entropy loss. The rate formula is:

H _q =(G ₀ -G)/G ₀ (1-3)

Where H _q is the entropy loss rate, G ₀ is the information entropy of the discrete data set, and G is the information entropy of the target data set.

In this embodiment, the entropy loss rate of the target data set is recorded as H ₁ , and the entropy loss rate H ₁ =(G0-G1)/G0 of the target data set can be calculated due to the preset entropy loss rate.

It should be noted that the preset entropy loss rate formula is associated with the preset interval loss rate formula, and if the N in the preset interval loss rate formula is changed with each data interval combination, Then, G0 in the preset entropy loss rate also needs to be selected to change with each data interval combination to improve the accuracy of the calculation.

S105. Determine whether the entropy loss rate is greater than the interval loss rate.

In this embodiment, it is specifically determined whether the entropy loss rate H _{1 of} the target data set is greater than the interval loss rate L _{1 of} the target data set. If the entropy loss rate is greater than the interval loss rate, step S106 is performed; if the entropy loss rate is not greater than the interval loss rate, step S107 is performed.

S106. Output the target data set to complete data discretization of the value range of the service data.

In this embodiment, if the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data, and specifically, the target data set may be performed. The save and save address information is sent to the user, because the user extracts the target data set as needed, such as model training for data mining or machine learning.

S107. Set the target data set to the discrete data set and return to perform the step of pre-merging data segments in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals until the The entropy loss rate is greater than the interval loss rate.

In this embodiment, if the entropy loss rate is not greater than the interval loss rate, the target data set is executed as the discrete data set, and the steps S102 to S105 are performed to perform the next round of data interval merging. The cycle is continued until the entropy loss rate is greater than the interval loss rate, and the continuous loop merging is stopped, wherein the target data set corresponding to the entropy loss rate being greater than the interval loss rate is the result of the last required data discretization.

The above embodiment discretizes the value range of the service data into a discrete data set by entropy-based data discretization, wherein the discrete data set includes a plurality of data intervals; and the data interval is merged by the preset merge rule until the merged data The entropy loss rate of the set is greater than the interval loss rate, so that the discrete interval of the combined data set is as small as possible and the entropy is as large as possible, thereby improving the effect of data discretization and facilitating data mining and machine learning.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of a data discretization method according to another embodiment of the present application. The data discretization method is specifically based on entropy-based data discretization, and can be run in a terminal or a server to discretize continuous attributes of data. As shown in FIG. 2, the data discretization method includes steps S201 to S209.

S201. Obtain service data of the target service and determine a value range of the service data.

In this embodiment, the determining the value range of the service data may determine the value range of the service data according to a manner of intercepting the user, or may be intercepted according to a preset intercept window manner to determine the service range. The value range of the data, the preset intercept window can be set by the user according to actual needs. The value range is the valid range of the business data and can reflect certain characteristics of the business data.

S202. Process the value range of the service data according to a preset processing rule.

In this embodiment, the process of processing the value range of the service data according to the preset processing rule includes: filtering, reducing, or normalizing the value range of the service data, etc., Better applied to data mining or machine learning for future discretization. Among them, the method of filtering noise reduction processing or normalization processing adopts the existing method, and will not be described in detail herein.

S203. Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals.

In this embodiment, based on the entropy-based data discretization, the value range of the service data is discretized to generate a corresponding discrete data set, and the discrete data set includes a plurality of data intervals. How many times the data interval is sorted to count the number of occurrences, and then the information entropy of the discrete data set can be calculated according to the calculation formula of the information entropy.

S204. Pre-merging two adjacent data segments in the discrete data set to obtain a plurality of pre-combined data intervals, and calculating information entropy of the pre-merged data interval.

In the present embodiment, for example, the discrete data set is S0, which includes a plurality of data intervals denoted as S00, S01, S02, ..., S0n. Merging the adjacent two data intervals in the discrete data set will generate new data intervals, such as (S00, S01), (S01, S02), ... (S0n-1, S0n), these new data The interval is the pre-combined data interval. The information entropy corresponding to these pre-combined data intervals is calculated by the calculation formula of information entropy. The information entropy corresponding to these pre-merged data intervals will be large and small, and the pre-existing data entropy will be found. Combine data ranges.

S205. Combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy and an interval loss rate of the target data set.

In this embodiment, since two data intervals, that is, pre-combined data intervals having the largest information entropy in the discrete data set, are actually merged, the original discrete data set may have a data interval relative to the target data set. The loss of information entropy, and thus the information entropy of the target data set and the corresponding interval loss rate. Specifically, the interval loss rate calculation formula in the above embodiment is used for calculation.

S206. Calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.

In this embodiment, specifically, the entropy loss rate of the target data set is calculated according to the information entropy of the discrete data set and the information entropy of the target data set by using Expressions 1-3.

S207. Determine whether the entropy loss rate is greater than the interval loss rate.

In this embodiment, it is determined whether the entropy loss rate is greater than the interval loss rate, and two determination results are generated. Specifically, if the entropy loss rate is greater than the interval loss rate, step S208 is performed; if the entropy loss rate is not greater than the interval loss rate, step S208 is performed.

S208. If the entropy loss rate is greater than the interval loss rate, output the target data set to complete data discretization of the value range of the service data.

S209. If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data interval in the discrete data set according to a preset merge rule. The step of obtaining a plurality of pre-merged data intervals until the entropy loss rate is greater than the interval loss rate.

In this embodiment, if the entropy loss rate is not greater than the interval loss rate, the target data set is executed as the discrete data set, and the steps S204 to S207 are performed to perform the next round of data interval merging. The cycle is continued until the entropy loss rate is greater than the interval loss rate, and the continuous loop merging is stopped, wherein the target data set corresponding to the entropy loss rate being greater than the interval loss rate is the result of the last required data discretization.

Please refer to FIG. 3. FIG. 3 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application. The data discretization device 300 can be installed in a server or a terminal. As shown in FIG. 3, the data discretization apparatus 300 includes: a discrete generation calculation unit 301, a first merge calculation unit 302, a second merge calculation unit 303, an entropy loss rate calculation unit 304, a loss rate determination unit 305, and a data set output unit. 306 and return to loop execution unit 307.

a discrete generation calculation unit 301, configured to discretize the value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval.

The first merge calculation unit 302 is configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval.

The second merge calculation unit 303 is configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate.

The entropy loss rate calculation unit 304 is configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.

The loss rate determining unit 305 is configured to determine whether the entropy loss rate is greater than the interval loss rate.

Specifically, if the loss rate determination unit 305 determines that the entropy loss rate is greater than the interval loss rate, the data set output unit 306 is invoked; if the loss rate determination unit 305 determines that the entropy loss rate is not greater than the interval loss The rate is then returned to the loop execution unit 307.

The data set output unit 306 is configured to output the target data set to complete data discretization of the value range of the service data.

Returning to the loop execution unit 307, configured to set the target data set as the discrete data set and return to perform performing the pre-merging data sections in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals. The step until the entropy loss rate is greater than the interval loss rate.

Please refer to FIG. 4. FIG. 4 is a schematic block diagram of a data discretization apparatus according to an embodiment of the present application. The data discretization device 400 can be installed in a server or a terminal. As shown in FIG. 4, the data discretization apparatus 400 includes: a value range determining unit 401, a value range processing unit 402, a discrete generation calculating unit 403, a first combining calculating unit 404, a second combining calculating unit 405, and an entropy loss rate. The calculation unit 406, the loss rate determination unit 407, the data set output unit 408, and the return loop execution unit 409.

The value range determining unit 401 is configured to obtain service data of the target service and determine a value range of the service data.

The value range processing unit 402 is configured to process the value range of the service data according to a preset processing rule.

a discrete generation calculation unit 403, for entropy-based data discretization, discretizing a value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval.

The first merge calculation unit 404 is configured to pre-merge two adjacent data segments in the discrete data set to obtain a plurality of pre-combined data intervals, and calculate information entropy of the pre-merged data interval.

The second merge calculation unit 405 is configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate.

The entropy loss rate calculation unit 406 is configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set.

The loss rate determining unit 407 is configured to determine whether the entropy loss rate is greater than the interval loss rate.

Specifically, if the loss rate determining unit 407 determines that the entropy loss rate is greater than the interval loss rate, the data set output unit 408 is invoked; if the loss rate determining unit 407 determines that the entropy loss rate is not greater than the interval loss The rate is then returned to the loop execution unit 409.

The data set output unit 408 is configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.

Returning to the loop execution unit 409, if the entropy loss rate is not greater than the interval loss rate, setting the target data set as the discrete data set and returning to performing the pre-merging the discrete according to a preset merge rule The data interval in the data set to obtain a plurality of pre-combined data intervals until the entropy loss rate is greater than the interval loss rate.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the data discretization device and the unit described above can refer to the corresponding process in the foregoing data discretization method embodiment, and Let me repeat.

The above apparatus may be embodied in the form of a computer program that can be run on a computer device as shown in FIG.

Referring to FIG. 5, FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 device can be a terminal or a server.

Referring to FIG. 5, the computer device 500 includes a processor 520, a memory and a network interface 550 connected by a system bus 510, wherein the memory can include a non-volatile storage medium 530 and an internal memory 540.

The non-volatile storage medium 530 can store an operating system 531 and a computer program 532. When the computer program 532 is executed, the processor 520 can be caused to perform a data discretization method.

The processor 520 is used to provide computing and control capabilities to support the operation of the entire computer device 500.

The internal memory 540 provides an environment for the operation of a computer program in a non-volatile storage medium that, when executed by the processor 520, causes the processor 520 to perform a data discretization method.

The network interface 550 is used for network communication, such as sending assigned tasks and the like. It will be understood by those skilled in the art that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device 500 to which the solution of the present application is applied, and a specific computer device. 500 may include more or fewer components than shown, or some components may be combined, or have different component arrangements.

The processor 520 is configured to run the program code stored in the memory to implement the process steps corresponding to the data discretization method in the foregoing embodiment.

It should be understood that, in the embodiment of the present application, the processor 520 may be a central processing unit (CPU), and the processor 520 may also be other general-purpose processors, a digital signal processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that the computer device 500 architecture illustrated in FIG. 5 does not constitute a limitation to computer device 500, may include more or fewer components than illustrated, or may combine certain components, or different components. Arrangement.

It will be understood by those skilled in the art that all or part of the processes in the above embodiments may be implemented by a computer program to instruct related hardware, and the program may be stored in a storage medium, which is readable by a computer. Storage medium. As in the embodiment of the present application, the program may be stored in a storage medium of the computer system and executed by at least one processor in the computer system to implement the flow steps including the embodiments of the methods described above.

The computer readable storage medium may be a medium that can store program code, such as a magnetic disk, an optical disk, a USB flash drive, a mobile hard disk, a random access memory (RAM), a magnetic disk, or an optical disk.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both, for clarity of hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.

The steps in the method of the embodiment of the present application may be sequentially adjusted, merged, and deleted according to actual needs.

The units in the apparatus of the embodiment of the present application may be combined, divided, and deleted according to actual needs.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be in essence or part of the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. There are a number of instructions for causing a computer device (which may be a personal computer, terminal, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.

The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any equivalents can be easily conceived by those skilled in the art within the technical scope disclosed in the present application. Modifications or substitutions are intended to be included within the scope of the present application. Therefore, the scope of protection of this application should be determined by the scope of protection of the claims.

Claims

A data discretization method comprising:

Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;

Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;

Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;

Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;

Determining whether the entropy loss rate is greater than the interval loss rate;

If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.
The data discretization method according to claim 1, wherein the determining whether the entropy loss rate is greater than the interval loss rate further comprises:

If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of steps of pre-merging the data intervals until the entropy loss rate is greater than the interval loss rate.
The data discretization method according to claim 1, wherein the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals comprises:

Adjacent two data intervals in the discrete data set are pre-combined to obtain a plurality of pre-combined data intervals.
The data discretization method according to claim 1, wherein the calculating the information entropy of the discrete data set and calculating the information entropy of the pre-merged data interval comprises:

Calculating an information entropy of the discrete data set and calculating an information entropy of the pre-merged data interval by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:

Where n is a positive integer greater than 1, i is a positive integer between 1 and n, p i is the probability of occurrence of the ith data, and H(p) is information entropy.
The data discretization method according to claim 4, wherein the calculating the interval loss rate of the target data set comprises: calculating a section loss rate of the target data set by using a preset interval loss rate formula, the pre- Let the interval loss rate formula be:

L q =x/N

Where L q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set;

Calculating, according to the information entropy of the discrete data set and the information entropy of the target data set, an entropy loss rate of the target data set, comprising: information entropy according to the discrete data set and information of the target data set The entropy uses a preset entropy loss rate formula to calculate an entropy loss rate of the target data set, and the preset entropy loss rate formula is:

H q =(G 0 -G)/G 0

Where H q is the entropy loss rate, G 0 is the information entropy of the discrete data set, and G is the information entropy of the target data set.
The data discretization method according to claim 1, wherein the entropy-based data discretization, before discretizing the value range of the service data to generate the corresponding discrete data set, further comprises:

Obtaining service data of the target service and determining a value range of the service data;

The value range of the service data is processed according to a preset processing rule.
The data discretization method according to claim 6, wherein the processing the value range of the service data according to a preset processing rule comprises: filtering and noise-reducing the value range of the service data or Normalized processing.
A data discretization device comprising:

a discrete generation calculation unit, configured to discretize data based on entropy, discretize a value range of the service data to generate a corresponding discrete data set, and calculate an information entropy of the discrete data set, wherein the discrete data set includes multiple Data interval

a first merge calculation unit, configured to pre-merge the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-merged data intervals, and calculate an information entropy of the pre-merged data interval;

a second merge calculation unit, configured to combine the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculate an information entropy of the target data set and an interval loss rate;

An entropy loss rate calculation unit, configured to calculate an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;

a loss rate determining unit, configured to determine whether the entropy loss rate is greater than the interval loss rate;

And a data set output unit, configured to output the target data set to complete data discretization of the value range of the service data if the entropy loss rate is greater than the interval loss rate.
The data discretization apparatus according to claim 8, further comprising:

Returning to the loop execution unit, if the entropy loss rate is not greater than the interval loss rate, setting the target data set as the discrete data set and returning to performing the pre-merging the discrete data according to a preset merge rule The data interval in the set to obtain a plurality of pre-combined data intervals until the entropy loss rate is greater than the interval loss rate.
The data discretization apparatus according to claim 8, wherein the first merge calculation unit is specifically configured to pre-merge two adjacent data intervals in the discrete data set to obtain a plurality of pre-combined data intervals.
A computer device comprising a memory, a processor, and a computer program stored on the memory and operative on the processor, the processor executing the computer program to implement the following steps:

Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;

Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;

Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;

Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;

Determining whether the entropy loss rate is greater than the interval loss rate;

If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.
The computer device of claim 11 wherein said processor, when executing said computer program, implements the following steps:

If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of steps of pre-merging the data intervals until the entropy loss rate is greater than the interval loss rate.
The computer device of claim 11 wherein said processor, when executing said computer program, implements the following steps:

Adjacent two data intervals in the discrete data set are pre-combined to obtain a plurality of pre-combined data intervals.
The computer device of claim 11 wherein said processor, when executing said computer program, implements the following steps:

Calculating an information entropy of the discrete data set and calculating an information entropy of the pre-merged data interval by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:

Where n is a positive integer greater than 1, i is a positive integer between 1 and n, p i is the probability of occurrence of the ith data, and H(p) is information entropy.
The computer device of claim 14, wherein the processor, when executing the computer program, implements the following steps:

Calculating an interval loss rate of the target data set by using a preset interval loss rate formula, where the preset interval loss rate formula is:

L q =x/N

Where L q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set;

Calculating an entropy loss rate of the target data set according to an information entropy of the discrete data set and an information entropy of the target data set by using a preset entropy loss rate formula, where the preset entropy loss rate formula is:

H q =(G 0 -G)/G 0

Where H q is the entropy loss rate, G 0 is the information entropy of the discrete data set, and G is the information entropy of the target data set.
A storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the following steps:

Entropy-based data discretization, discretizing the value range of the service data to generate a corresponding discrete data set, and calculating an information entropy of the discrete data set, wherein the discrete data set includes a plurality of data intervals;

Pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of pre-combined data intervals, and calculating an information entropy of the pre-merged data interval;

Combining the pre-merged data intervals having the largest information entropy in the discrete data set as a target data set, and calculating an information entropy of the target data set and an interval loss rate;

Calculating an entropy loss rate of the target data set according to information entropy of the discrete data set and information entropy of the target data set;

Determining whether the entropy loss rate is greater than the interval loss rate;

If the entropy loss rate is greater than the interval loss rate, the target data set is output to complete data discretization of the value range of the service data.
The storage medium of claim 16 wherein said program instructions, when executed by a processor, cause said processor to perform the following steps:

If the entropy loss rate is not greater than the interval loss rate, set the target data set to the discrete data set and return to perform performing the pre-merging the data intervals in the discrete data set according to a preset merge rule to obtain a plurality of steps of pre-merging the data intervals until the entropy loss rate is greater than the interval loss rate.
The storage medium of claim 16 wherein said program instructions, when executed by a processor, cause said processor to perform the following steps:

Adjacent two data intervals in the discrete data set are pre-combined to obtain a plurality of pre-combined data intervals.
The storage medium of claim 16 wherein said program instructions, when executed by a processor, cause said processor to perform the following steps:

Calculating an information entropy of the discrete data set and calculating an information entropy of the pre-merged data interval by using a calculation formula of information entropy, wherein the calculation formula of the information entropy is:

Where n is a positive integer greater than 1, i is a positive integer between 1 and n, p i is the probability of occurrence of the ith data, and H(p) is information entropy.
The storage medium of claim 19, wherein the program instructions, when executed by a processor, cause the processor to perform the following steps:

Calculating an interval loss rate of the target data set by using a preset interval loss rate formula, where the preset interval loss rate formula is:

L q =x/N

Where L q is the interval loss rate, x is the number of data intervals lost after each combination, and N is the number of data intervals of the discrete data set;

Calculating an entropy loss rate of the target data set according to an information entropy of the discrete data set and an information entropy of the target data set by using a preset entropy loss rate formula, where the preset entropy loss rate formula is:

H q =(G 0 -G)/G 0

Where H q is the entropy loss rate, G 0 is the information entropy of the discrete data set, and G is the information entropy of the target data set.